xelys jobs xelys jobs

Staff Site Reliability Engineer

Thrive Market

leadpermanentdevopssecurity Los Angeles, CA 3 days ago via LinkedIn

See how well this job matches your profile

Sign up to get an AI match score and generate a tailored application in seconds.

Get your match score

Tags

Site Reliability Engineering (SRE)KubernetesAWSDatadogPrometheusGrafanaSLOs & SLIsError BudgetsIncident ResponseChaos Engineering

About the role

Role Overview

Thrive Market is hiring a Staff Site Reliability Engineer to define and build the reliability foundation for its platform. You will establish the company’s SRE practice from the ground up—setting SLOs/SLIs and error budgets, improving observability, and creating frameworks to ensure systems scale reliably during rapid growth.

Responsibilities

Reliability & Observability

  • Define, implement, and own SLOs and SLIs for critical platform services.
  • Build and maintain monitoring/alerting/observability using tools such as Datadog, Prometheus, and Grafana.
  • Establish error budgets to balance feature velocity with reliability investments.
  • Lead incident response, run blameless postmortems, and drive systemic improvements to prevent recurrence.
  • Design and implement chaos engineering practices to proactively identify failure modes.

Infrastructure & Platform

  • Architect and optimize the Kubernetes-based container orchestration platform for reliability, performance, and cost efficiency.
  • Support infrastructure and platform migrations with minimal business disruption.
  • Contribute to evaluation/execution of a potential migration to a next-generation ecommerce platform with reliability planning and risk mitigation.
  • Design and implement automated deployment pipelines with feature flags and rollback/roll-forward capabilities.
  • Own disaster recovery, capacity planning, and system hardening initiatives.
  • Work with product engineering teams to help scale infrastructure in AWS and apply SRE best practices.

Culture & Process

  • Establish SRE as a practice: charter, processes, and engagement model with product engineering.
  • Promote operational excellence and data-driven reliability decisions (continuous improvement).
  • Maintain technical documentation: architecture decisions, runbooks, incident procedures, and operational playbooks.
  • Participate in weekly on-call and help build sustainable on-call practices.
  • Identify systemic issues/inefficiencies and recommend improvements across engineering.

Requirements

  • B.S. in Computer Science (or equivalent experience).
  • 7+ years of hands-on experience in SRE, DevOps, or Infrastructure Engineering, with a strong track record of improving reliability in fast-growing environments.

Nice-to-Haves

  • Familiarity with The Google SRE Handbook, Accelerate, The DevOps Handbook, and related reliability/engineering practices.
  • Experience building SRE practices from scratch and leading reliability strategy alongside product engineering.

Location

  • Los Angeles, CA

About Thrive Market

Thrive Market is an online, membership-based market focused on making healthy and sustainable living easy and affordable. Founded in 2014, it delivers quality products at member-only prices and matches every paid membership with a free one for someone in need. The company is a Certified B Corporation and Climate Neutral Certified, serving 1.7M+ members.

Scraped 6/16/2026

xelys jobs xelys jobs

Built for remote job seekers. Powered by AI.