Staff Site Reliability Engineer
Thrive Market
See how well this job matches your profile
Sign up to get an AI match score and generate a tailored application in seconds.
Get your match scoreTags
About the role
Role Overview
Thrive Market is hiring a Staff Site Reliability Engineer to define and build the reliability foundation for its platform. You will establish the company’s SRE practice from the ground up—setting SLOs/SLIs and error budgets, improving observability, and creating frameworks to ensure systems scale reliably during rapid growth.
Responsibilities
Reliability & Observability
- Define, implement, and own SLOs and SLIs for critical platform services.
- Build and maintain monitoring/alerting/observability using tools such as Datadog, Prometheus, and Grafana.
- Establish error budgets to balance feature velocity with reliability investments.
- Lead incident response, run blameless postmortems, and drive systemic improvements to prevent recurrence.
- Design and implement chaos engineering practices to proactively identify failure modes.
Infrastructure & Platform
- Architect and optimize the Kubernetes-based container orchestration platform for reliability, performance, and cost efficiency.
- Support infrastructure and platform migrations with minimal business disruption.
- Contribute to evaluation/execution of a potential migration to a next-generation ecommerce platform with reliability planning and risk mitigation.
- Design and implement automated deployment pipelines with feature flags and rollback/roll-forward capabilities.
- Own disaster recovery, capacity planning, and system hardening initiatives.
- Work with product engineering teams to help scale infrastructure in AWS and apply SRE best practices.
Culture & Process
- Establish SRE as a practice: charter, processes, and engagement model with product engineering.
- Promote operational excellence and data-driven reliability decisions (continuous improvement).
- Maintain technical documentation: architecture decisions, runbooks, incident procedures, and operational playbooks.
- Participate in weekly on-call and help build sustainable on-call practices.
- Identify systemic issues/inefficiencies and recommend improvements across engineering.
Requirements
- B.S. in Computer Science (or equivalent experience).
- 7+ years of hands-on experience in SRE, DevOps, or Infrastructure Engineering, with a strong track record of improving reliability in fast-growing environments.
Nice-to-Haves
- Familiarity with The Google SRE Handbook, Accelerate, The DevOps Handbook, and related reliability/engineering practices.
- Experience building SRE practices from scratch and leading reliability strategy alongside product engineering.
Location
- Los Angeles, CA
About Thrive Market
Thrive Market is an online, membership-based market focused on making healthy and sustainable living easy and affordable. Founded in 2014, it delivers quality products at member-only prices and matches every paid membership with a free one for someone in need. The company is a Certified B Corporation and Climate Neutral Certified, serving 1.7M+ members.
Scraped 6/16/2026