Distinguished Site Reliability Engineer - Cloud
NVIDIA
See how well this job matches your profile
Sign up to get an AI match score and generate a tailored application in seconds.
Get your match scoreTags
About the role
Role Overview
You will be part of Site Reliability Engineering (SRE) at NVIDIA, helping design, build, and run large-scale production systems with high efficiency and availability. The work emphasizes Kubernetes-enabled GPU cloud reliability, proactive outage prevention, automation, and continuous improvement of service performance.
What You’ll Be Doing
- Lead, design, implement, and support operational and reliability aspects of large-scale Kubernetes clusters with a focus on performance at scale
- Improve the end-to-end service lifecycle: inception/design → deployment → operations → refinement
- Support services pre-launch via system design consulting, building tools/platforms/frameworks, capacity management, and launch reviews
- Operate live services by measuring and monitoring availability, latency, and overall system health
- Scale sustainably using automation and improve reliability and delivery velocity
- Practice sustainable incident response and conduct blameless postmortems
- Participate in an on-call rotation for production systems
What We Need To See
- BS (CS or related) or equivalent technical experience
- 16+ years experience with infrastructure automation and distributed systems design
- Experience designing and developing tools for running large-scale private or public cloud systems in production
- Programming experience in one or more of: Python, Go, Perl, Ruby
- Deep knowledge of Linux, networking, and containers
Ways To Stand Out
- Interest in analyzing and fixing large-scale distributed systems
- Systematic problem-solving with strong communication and ownership
- Ability to debug/optimize code and automate routine tasks
- Experience running private/public cloud systems based on Kubernetes, OpenStack, and Docker
Compensation
- Base salary range: $320,000–$488,750 USD (varies by location and experience)
- Eligible for equity and benefits
About NVIDIA
NVIDIA is a technology company focused on accelerated computing and GPU-based platforms for data centers, cloud, and developer ecosystems. The role highlights its internal and external GPU cloud services and the engineering practices used to run them with high reliability and availability.
Scraped 6/16/2026