xelys jobs xelys jobs

Senior Site Reliability Engineer 5, CORE (Resilience Operations)

Netflix

hybridseniorpermanentbackenddevops United States 45 days ago via LinkedIn

See how well this job matches your profile

Sign up to get an AI match score and generate a tailored application in seconds.

Get your match score

Tags

Site Reliability EngineeringSREPythonGoJavaAWSGCPDistributed SystemsObservabilitySLOs

About the role

Role Overview

The Critical Operations and Reliability Engineering team at Netflix is looking for a Senior Site Reliability Engineer to strengthen reliability, observability, and operational excellence across Netflix’s large-scale platform, including Streaming, Games, and Ads. You will partner with engineering teams to design resilient architectures, build automation, and improve how the organization learns from incidents.

Location/Work Model: UCAN Remote; 10–15% travel

Responsibilities

  • Design and evolve resilient infrastructure for Netflix Streaming services at global scale (scalable, fault-tolerant, operable).
  • Run resilience tests at scale by intentionally inducing failures to validate behavior and uncover weaknesses.
  • Lead incident support: when failures occur, diagnose, drive fixes, and re-validate via repeat testing; partner to safely ship production changes.
  • Embed reliability, observability, and security throughout the software development lifecycle (design/readiness reviews, rollout, and operations).
  • Define and track Service Level Objectives (SLOs) and reliability metrics to guide capacity planning, operational priorities, and tradeoffs.
  • Build/improve automation for deployment, monitoring, capacity management, and incident response.
  • Participate in on-call rotations to support 24/7 availability for critical Streaming services.
  • Own incident follow-ups: triage → mitigation → systemic fixes and prevent repeat issues.
  • Proactively identify and reduce instability in distributed systems by analyzing real production failures and driving architectural/operational improvements.
  • Promote a reliability culture via documentation, best-practice guides, and tooling that helps other teams adopt improvements.

Requirements

  • 5+ years in SRE, Production Engineering, or a similar role operating business-critical, high-traffic services in production.
  • Strong coding skills in Python, Go, or Java, focused on automating operations.
  • Hands-on cloud infrastructure experience with AWS/Azure/GCP, including large-scale environments and platform orchestration/compute abstraction.
  • Deep knowledge of large-scale distributed systems, including failure modes, performance bottlenecks, resilience, and graceful degradation.
  • Proven ability to identify reliability risks using metrics, incidents, architecture reviews, or resilience testing, then implement scalable fixes.
  • Strong observability and performance tuning using metrics, logs, and (implied) tracing.

Nice-to-haves

  • Experience spanning multiple Netflix domains (e.g., Streaming plus other vertices like Games/Ads) and influencing reliability across them.

About Netflix

Netflix is a global streaming and entertainment platform focused on delivering high-quality experiences to members worldwide. Its Critical Operations and Reliability Engineering organization builds and evolves the systems and operational practices that keep Netflix resilient under failures, traffic spikes, and constant change.

Scraped 4/1/2026

xelys jobs xelys jobs

Built for remote job seekers. Powered by AI.