About the role

Role Overview

As a Distinguished Site Reliability Engineer (SRE) – Cloud at NVIDIA, you will design, build, and run highly reliable, large-scale production systems. The role is focused on operating Kubernetes-based cloud services with strong attention to performance, latency, monitoring, and capacity.

What You'll Be Doing

Lead design, implementation, and support of operational and reliability aspects of large-scale Kubernetes clusters, emphasizing performance at scale
Own the end-to-end service lifecycle: from inception/design through deployment, operation, and refinement
Support services pre-launch via system design consulting, building tools/platforms/frameworks, capacity management, and launch reviews
Maintain live services by measuring and monitoring availability, latency, and overall health
Scale sustainably using automation, and evolve systems by driving reliability and delivery velocity improvements
Practice sustainable incident response and run blameless postmortems
Participate in an on-call rotation to support production systems

Requirements

BS in Computer Science or related technical field (or equivalent coding experience)
16+ years experience with:
- Infrastructure automation
- Distributed systems design
- Designing and building tools for running large-scale private/public cloud systems in production
Experience with one or more languages: Python, Go, Perl, or Ruby
In-depth knowledge of Linux, Networking, and Containers

Nice to Have / Stand Out From the Crowd

Interest in crafting/analyzing/fixing large-scale distributed systems
Systematic problem-solving with strong communication and ownership
Ability to debug/optimize code and automate routine tasks
Experience running private/public cloud systems using Kubernetes, OpenStack, and Docker

Compensation

Base salary range: $320,000 – $488,750 USD (varies by location and experience)
Eligible for equity and benefits

About NVIDIA

NVIDIA is a technology company focused on accelerated computing, including GPU-based platforms and cloud services. Its teams build and operate large-scale production systems—such as GPU cloud infrastructure—requiring high availability, performance, and reliability.

Distinguished Site Reliability Engineer - Cloud

Tags