About the role

Role Overview

You will be part of Site Reliability Engineering (SRE) at NVIDIA, helping design, build, and run large-scale production systems with high efficiency and availability. The work emphasizes Kubernetes-enabled GPU cloud reliability, proactive outage prevention, automation, and continuous improvement of service performance.

What You’ll Be Doing

Lead, design, implement, and support operational and reliability aspects of large-scale Kubernetes clusters with a focus on performance at scale
Improve the end-to-end service lifecycle: inception/design → deployment → operations → refinement
Support services pre-launch via system design consulting, building tools/platforms/frameworks, capacity management, and launch reviews
Operate live services by measuring and monitoring availability, latency, and overall system health
Scale sustainably using automation and improve reliability and delivery velocity
Practice sustainable incident response and conduct blameless postmortems
Participate in an on-call rotation for production systems

What We Need To See

BS (CS or related) or equivalent technical experience
16+ years experience with infrastructure automation and distributed systems design
Experience designing and developing tools for running large-scale private or public cloud systems in production
Programming experience in one or more of: Python, Go, Perl, Ruby
Deep knowledge of Linux, networking, and containers

Ways To Stand Out

Interest in analyzing and fixing large-scale distributed systems
Systematic problem-solving with strong communication and ownership
Ability to debug/optimize code and automate routine tasks
Experience running private/public cloud systems based on Kubernetes, OpenStack, and Docker

Compensation

Base salary range: $320,000–$488,750 USD (varies by location and experience)
Eligible for equity and benefits

About NVIDIA

NVIDIA is a technology company focused on accelerated computing and GPU-based platforms for data centers, cloud, and developer ecosystems. The role highlights its internal and external GPU cloud services and the engineering practices used to run them with high reliability and availability.

Distinguished Site Reliability Engineer - Cloud

Tags