About the role

Role Overview

The Site Reliability Engineer II will focus on automating, monitoring, and maintaining the reliability of AI inference workloads within a cloud platform. This role is essential for reducing operational toil, improving system stability, and supporting continuous deployment processes.

Key Responsibilities

Build and maintain dashboards, alerts, and monitoring for inference workloads using the existing observability platform
Develop automation and tooling in Python or Go to enhance system reliability and reduce manual work
Create and improve runbooks for inference-specific operational procedures
Support SLO tracking and reporting to identify trends and improvement areas
Maintain CI/CD pipelines, deployment safety checks, and rollback processes
Collaborate with product engineering teams to troubleshoot complex issues across the stack
Participate in on-call rotations, respond to production incidents, and conduct blameless post-mortems

Requirements

2+ years of Site Reliability Engineering experience
Bachelor's Degree or equivalent professional experience
Proficiency in Python or Go with experience in automation scripting
Linux systems administration and infrastructure troubleshooting experience
Familiarity with Kubernetes and containerization concepts
Experience with monitoring and observability tools such as Prometheus or Grafana
Exposure to CI/CD pipelines and infrastructure-as-code tools like Terraform or SaltStack
Willingness to learn and curiosity about AI infrastructure and distributed systems

About RemoteHunter

The client operates in the cloud computing and AI infrastructure space, providing platforms that enable customers to run AI inference models and developers to create AI applications. They design, implement, deploy, and operate scalable, serverless inference workloads using GPU infrastructure and Kubernetes to deliver reliable AI services at scale.

Tags

About the role

Role Overview

Key Responsibilities

Requirements

About RemoteHunter