xelys jobs xelys jobs

Distinguished Site Reliability Engineer - Cloud

NVIDIA

leadpermanentdevopsbackend Texas, United States 3 days ago via LinkedIn
320,000 - 488,750 USD/annual

See how well this job matches your profile

Sign up to get an AI match score and generate a tailored application in seconds.

Get your match score

Tags

Site Reliability EngineeringSREKubernetesLinuxNetworkingDistributed SystemsInfrastructure AutomationMonitoringLoggingPython

About the role

Role Overview

As a Distinguished Site Reliability Engineer (SRE) – Cloud at NVIDIA, you will design, build, and run highly reliable, large-scale production systems. The role is focused on operating Kubernetes-based cloud services with strong attention to performance, latency, monitoring, and capacity.

What You'll Be Doing

  • Lead design, implementation, and support of operational and reliability aspects of large-scale Kubernetes clusters, emphasizing performance at scale
  • Own the end-to-end service lifecycle: from inception/design through deployment, operation, and refinement
  • Support services pre-launch via system design consulting, building tools/platforms/frameworks, capacity management, and launch reviews
  • Maintain live services by measuring and monitoring availability, latency, and overall health
  • Scale sustainably using automation, and evolve systems by driving reliability and delivery velocity improvements
  • Practice sustainable incident response and run blameless postmortems
  • Participate in an on-call rotation to support production systems

Requirements

  • BS in Computer Science or related technical field (or equivalent coding experience)
  • 16+ years experience with:
    • Infrastructure automation
    • Distributed systems design
    • Designing and building tools for running large-scale private/public cloud systems in production
  • Experience with one or more languages: Python, Go, Perl, or Ruby
  • In-depth knowledge of Linux, Networking, and Containers

Nice to Have / Stand Out From the Crowd

  • Interest in crafting/analyzing/fixing large-scale distributed systems
  • Systematic problem-solving with strong communication and ownership
  • Ability to debug/optimize code and automate routine tasks
  • Experience running private/public cloud systems using Kubernetes, OpenStack, and Docker

Compensation

  • Base salary range: $320,000 – $488,750 USD (varies by location and experience)
  • Eligible for equity and benefits

About NVIDIA

NVIDIA is a technology company focused on accelerated computing, including GPU-based platforms and cloud services. Its teams build and operate large-scale production systems—such as GPU cloud infrastructure—requiring high availability, performance, and reliability.

Scraped 6/20/2026

xelys jobs xelys jobs

Built for remote job seekers. Powered by AI.