Site Reliability Engineer II
RemoteHunter
See how well this job matches your profile
Sign up to get an AI match score and generate a tailored application in seconds.
Get your match scoreTags
About the role
Role Overview
The Site Reliability Engineer II will focus on automating, monitoring, and maintaining the reliability of AI inference workloads within a cloud platform. This role is essential for reducing operational toil, improving system stability, and supporting continuous deployment processes.
Key Responsibilities
- Build and maintain dashboards, alerts, and monitoring for inference workloads using the existing observability platform
- Develop automation and tooling in Python or Go to enhance system reliability and reduce manual work
- Create and improve runbooks for inference-specific operational procedures
- Support SLO tracking and reporting to identify trends and improvement areas
- Maintain CI/CD pipelines, deployment safety checks, and rollback processes
- Collaborate with product engineering teams to troubleshoot complex issues across the stack
- Participate in on-call rotations, respond to production incidents, and conduct blameless post-mortems
Requirements
- 2+ years of Site Reliability Engineering experience
- Bachelor's Degree or equivalent professional experience
- Proficiency in Python or Go with experience in automation scripting
- Linux systems administration and infrastructure troubleshooting experience
- Familiarity with Kubernetes and containerization concepts
- Experience with monitoring and observability tools such as Prometheus or Grafana
- Exposure to CI/CD pipelines and infrastructure-as-code tools like Terraform or SaltStack
- Willingness to learn and curiosity about AI infrastructure and distributed systems
About RemoteHunter
The client operates in the cloud computing and AI infrastructure space, providing platforms that enable customers to run AI inference models and developers to create AI applications. They design, implement, deploy, and operate scalable, serverless inference workloads using GPU infrastructure and Kubernetes to deliver reliable AI services at scale.
Scraped 4/1/2026