xelys jobs xelys jobs

Site Reliability Engineer II

RemoteHunter

hybridmidpermanentdevops United States Yesterday via LinkedIn
95,000 - 171,000 USD/annual

See how well this job matches your profile

Sign up to get an AI match score and generate a tailored application in seconds.

Get your match score

Tags

KubernetesPythonGoPrometheusGrafanaTerraformCI/CDLinuxAI InfrastructureSite Reliability Engineering

About the role

Role Overview

The Site Reliability Engineer II will focus on automating, monitoring, and maintaining the reliability of AI inference workloads within a cloud platform. This role is essential for reducing operational toil, improving system stability, and supporting continuous deployment processes.

Key Responsibilities

  • Build and maintain dashboards, alerts, and monitoring for inference workloads using the existing observability platform
  • Develop automation and tooling in Python or Go to enhance system reliability and reduce manual work
  • Create and improve runbooks for inference-specific operational procedures
  • Support SLO tracking and reporting to identify trends and improvement areas
  • Maintain CI/CD pipelines, deployment safety checks, and rollback processes
  • Collaborate with product engineering teams to troubleshoot complex issues across the stack
  • Participate in on-call rotations, respond to production incidents, and conduct blameless post-mortems

Requirements

  • 2+ years of Site Reliability Engineering experience
  • Bachelor's Degree or equivalent professional experience
  • Proficiency in Python or Go with experience in automation scripting
  • Linux systems administration and infrastructure troubleshooting experience
  • Familiarity with Kubernetes and containerization concepts
  • Experience with monitoring and observability tools such as Prometheus or Grafana
  • Exposure to CI/CD pipelines and infrastructure-as-code tools like Terraform or SaltStack
  • Willingness to learn and curiosity about AI infrastructure and distributed systems

About RemoteHunter

The client operates in the cloud computing and AI infrastructure space, providing platforms that enable customers to run AI inference models and developers to create AI applications. They design, implement, deploy, and operate scalable, serverless inference workloads using GPU infrastructure and Kubernetes to deliver reliable AI services at scale.

Scraped 4/1/2026

xelys jobs xelys jobs

Built for remote job seekers. Powered by AI.