About the role

Role Overview

As a Staff Site Reliability Engineer (SRE) on Babylist’s Platform team, you’ll keep Babylist’s infrastructure reliable, fast, and scalable for millions of users. This is an engineering-evolution role (not maintenance): you’ll actively improve how AWS infrastructure, CI systems, and developer tooling are built and operated.

Responsibilities

Own infrastructure and reliability practices that support 9M+ users and the engineers building for them
Evolve AWS infrastructure and reliability operations across teams with wide leverage
Drive improvements to Infrastructure as Code (IaC) using Terraform
Design, improve, and maintain CI/CD systems focused on developer velocity
Build and tune observability and alerting that is actionable and low-noise
Lead/participate in on-call and incident management processes
Operate and debug Kubernetes in production

Requirements

Deep, hands-on Terraform expertise (own IaC end-to-end)
Strong AWS experience at scale, including:
- EKS, RDS, cloud networking, DNS, CDNs, and load balancers
Experience operating Kubernetes in production (debugging hard issues)
Comfort designing and improving CI/CD systems (e.g., CircleCI, GitHub Actions)
Solid observability instincts with tools such as Datadog, Sentry, PagerDuty, Cronitor
Experience with on-call and incident management

Tech Stack (from posting)

Ruby on Rails, AWS, Sidekiq, MySQL, Redis

About Babylist

Babylist is a leading platform for expecting and new families, helping more than 10 million people shop with seamless purchasing, guidance, and expert recommendations. The company has grown from a baby registry into a broader ecosystem including the Babylist Shop, Health, Money, showrooms, and branded content, and is positioned as an AI-forward tech organization.