xelys jobs xelys jobs

Distinguished Site Reliability Engineer - Cloud

NVIDIA

leadpermanentdevopsbackend Colorado, United States 3 days ago via LinkedIn
320,000 - 488,750 USD/annual

See how well this job matches your profile

Sign up to get an AI match score and generate a tailored application in seconds.

Get your match score

Tags

Site Reliability Engineering (SRE)KubernetesLinuxInfrastructure AutomationDistributed SystemsMonitoringAlertingBlameless PostmortemsPythonNetworking

About the role

Role Overview

You will be part of Site Reliability Engineering (SRE) at NVIDIA, helping design, build, and run large-scale production systems with high efficiency and availability. The work emphasizes Kubernetes-enabled GPU cloud reliability, proactive outage prevention, automation, and continuous improvement of service performance.

What You’ll Be Doing

  • Lead, design, implement, and support operational and reliability aspects of large-scale Kubernetes clusters with a focus on performance at scale
  • Improve the end-to-end service lifecycle: inception/design → deployment → operations → refinement
  • Support services pre-launch via system design consulting, building tools/platforms/frameworks, capacity management, and launch reviews
  • Operate live services by measuring and monitoring availability, latency, and overall system health
  • Scale sustainably using automation and improve reliability and delivery velocity
  • Practice sustainable incident response and conduct blameless postmortems
  • Participate in an on-call rotation for production systems

What We Need To See

  • BS (CS or related) or equivalent technical experience
  • 16+ years experience with infrastructure automation and distributed systems design
  • Experience designing and developing tools for running large-scale private or public cloud systems in production
  • Programming experience in one or more of: Python, Go, Perl, Ruby
  • Deep knowledge of Linux, networking, and containers

Ways To Stand Out

  • Interest in analyzing and fixing large-scale distributed systems
  • Systematic problem-solving with strong communication and ownership
  • Ability to debug/optimize code and automate routine tasks
  • Experience running private/public cloud systems based on Kubernetes, OpenStack, and Docker

Compensation

  • Base salary range: $320,000–$488,750 USD (varies by location and experience)
  • Eligible for equity and benefits

About NVIDIA

NVIDIA is a technology company focused on accelerated computing and GPU-based platforms for data centers, cloud, and developer ecosystems. The role highlights its internal and external GPU cloud services and the engineering practices used to run them with high reliability and availability.

Scraped 6/16/2026

xelys jobs xelys jobs

Built for remote job seekers. Powered by AI.