Distinguished Site Reliability Engineer - Cloud
NVIDIA
leadpermanentdevopsbackend Texas, United States 3 days ago via LinkedIn
320,000 - 488,750 USD/annual
See how well this job matches your profile
Sign up to get an AI match score and generate a tailored application in seconds.
Get your match scoreTags
Site Reliability EngineeringSREKubernetesLinuxNetworkingDistributed SystemsInfrastructure AutomationMonitoringLoggingPython
About the role
Role Overview
As a Distinguished Site Reliability Engineer (SRE) – Cloud at NVIDIA, you will design, build, and run highly reliable, large-scale production systems. The role is focused on operating Kubernetes-based cloud services with strong attention to performance, latency, monitoring, and capacity.
What You'll Be Doing
- Lead design, implementation, and support of operational and reliability aspects of large-scale Kubernetes clusters, emphasizing performance at scale
- Own the end-to-end service lifecycle: from inception/design through deployment, operation, and refinement
- Support services pre-launch via system design consulting, building tools/platforms/frameworks, capacity management, and launch reviews
- Maintain live services by measuring and monitoring availability, latency, and overall health
- Scale sustainably using automation, and evolve systems by driving reliability and delivery velocity improvements
- Practice sustainable incident response and run blameless postmortems
- Participate in an on-call rotation to support production systems
Requirements
- BS in Computer Science or related technical field (or equivalent coding experience)
- 16+ years experience with:
- Infrastructure automation
- Distributed systems design
- Designing and building tools for running large-scale private/public cloud systems in production
- Experience with one or more languages: Python, Go, Perl, or Ruby
- In-depth knowledge of Linux, Networking, and Containers
Nice to Have / Stand Out From the Crowd
- Interest in crafting/analyzing/fixing large-scale distributed systems
- Systematic problem-solving with strong communication and ownership
- Ability to debug/optimize code and automate routine tasks
- Experience running private/public cloud systems using Kubernetes, OpenStack, and Docker
Compensation
- Base salary range: $320,000 – $488,750 USD (varies by location and experience)
- Eligible for equity and benefits
About NVIDIA
NVIDIA is a technology company focused on accelerated computing, including GPU-based platforms and cloud services. Its teams build and operate large-scale production systems—such as GPU cloud infrastructure—requiring high availability, performance, and reliability.
Scraped 6/20/2026