xelys jobs xelys jobs

Site Reliability Engineer

Runpod

full-remoteseniorpermanentdevopsbackend United States 6 days ago via LinkedIn

See how well this job matches your profile

Sign up to get an AI match score and generate a tailored application in seconds.

Get your match score

Tags

Site Reliability EngineeringSRESLIsSLOsIncident ResponseObservabilityPrometheusGrafanaLinuxDistributed Systems

About the role

Role Overview

Site Reliability Engineer (SRE) on Runpod’s Reliability team. You’ll help ensure the stability and resilience of Runpod’s distributed platform by owning reliability standards, incident response processes, observability systems, and automation that reduces operational toil.

Responsibilities

Reliability Engineering

  • Define and implement SLIs/SLOs for critical services
  • Lead incident response and coordinate cross-team mitigation
  • Run blameless postmortems and ensure corrective actions are completed
  • Perform production readiness reviews for new services and features
  • Identify systemic risks and drive preventative improvements

Observability & Monitoring

  • Design and improve monitoring, alerting, and dashboards (e.g., Prometheus/Grafana)
  • Improve alert signal-to-noise ratio and reduce alert fatigue
  • Build internal tooling for reliability tracking and reporting
  • Improve visibility into GPU performance and distributed systems health

Automation & Toil Reduction

  • Automate recurring operational workflows
  • Build scripts/tools in Python, Go, Bash to eliminate manual processes
  • Improve deployment safety via automation and guardrails
  • Strengthen CI/CD reliability and release processes

Cross-Functional Reliability Advocacy

  • Partner with engineering teams to improve system resilience
  • Provide guidance on fault tolerance, scalability, and failure handling
  • Contribute to architecture discussions with a reliability-first mindset

Requirements

  • 5+ years in SRE, Reliability Engineering, or Production Engineering
  • Strong Linux systems and Networking expertise
  • Experience managing containerized production systems
  • Strong understanding of distributed systems (and likely failure handling)

Impact

  • Increase platform uptime and reduce incident frequency/duration
  • Operationalize SLIs/SLOs and improve MTTR via tooling, automation, and runbooks
  • Strengthen production readiness standards and drive systemic reliability improvements

About Runpod

Runpod is a developer-focused platform for building and running custom AI systems at production scale. It provides infrastructure purpose-built for modern AI workloads and helps teams move from experimentation to deployment across cloud, on-prem, and hybrid environments. The Reliability team ensures the platform is resilient, observable, and scalable under real-world production conditions.

Scraped 6/15/2026

xelys jobs xelys jobs

Built for remote job seekers. Powered by AI.