About the role

Role Overview

Site Reliability Engineer (SRE) on Runpod’s Reliability team. You’ll help ensure the stability and resilience of Runpod’s distributed platform by owning reliability standards, incident response processes, observability systems, and automation that reduces operational toil.

Responsibilities

Reliability Engineering

Define and implement SLIs/SLOs for critical services
Lead incident response and coordinate cross-team mitigation
Run blameless postmortems and ensure corrective actions are completed
Perform production readiness reviews for new services and features
Identify systemic risks and drive preventative improvements

Observability & Monitoring

Design and improve monitoring, alerting, and dashboards (e.g., Prometheus/Grafana)
Improve alert signal-to-noise ratio and reduce alert fatigue
Build internal tooling for reliability tracking and reporting
Improve visibility into GPU performance and distributed systems health

Automation & Toil Reduction

Automate recurring operational workflows
Build scripts/tools in Python, Go, Bash to eliminate manual processes
Improve deployment safety via automation and guardrails
Strengthen CI/CD reliability and release processes

Cross-Functional Reliability Advocacy

Partner with engineering teams to improve system resilience
Provide guidance on fault tolerance, scalability, and failure handling
Contribute to architecture discussions with a reliability-first mindset

Requirements

5+ years in SRE, Reliability Engineering, or Production Engineering
Strong Linux systems and Networking expertise
Experience managing containerized production systems
Strong understanding of distributed systems (and likely failure handling)

Impact

Increase platform uptime and reduce incident frequency/duration
Operationalize SLIs/SLOs and improve MTTR via tooling, automation, and runbooks
Strengthen production readiness standards and drive systemic reliability improvements

About Runpod

Runpod is a developer-focused platform for building and running custom AI systems at production scale. It provides infrastructure purpose-built for modern AI workloads and helps teams move from experimentation to deployment across cloud, on-prem, and hybrid environments. The Reliability team ensures the platform is resilient, observable, and scalable under real-world production conditions.

Tags