Site Reliability Engineer, Inference Infrastructure
Cohere
seniorpermanentdevopsbackend San Francisco, CA 6 days ago via LinkedIn
See how well this job matches your profile
Sign up to get an AI match score and generate a tailored application in seconds.
Get your match scoreTags
Site Reliability EngineeringKubernetesDistributed SystemsLinuxGolangC++Cloud InfrastructureGPU WorkloadsObservabilitySLOs
About the role
Role Overview
Cohere is hiring a Site Reliability Engineer (Inference Infrastructure) for the Model Serving team. The team develops, deploys, and operates the AI platform that delivers Cohere’s large language models through API endpoints, focusing on low latency, high throughput, and high availability.
Responsibilities
- Build self-service systems to automate managing, deploying, and operating services.
- Develop and maintain custom Kubernetes operators for language model deployments.
- Automate environment observability and resilience.
- Enable developers to troubleshoot and resolve issues effectively.
- Take ownership of reliability targets by participating in an on-call rotation and helping meet defined SLOs.
- Collaborate with internal teams and influence the Infrastructure roadmap based on developer feedback.
- Improve team effectiveness through knowledge sharing and active review processes.
- Interface with customers to support customized deployments meeting specific needs.
Requirements
- 5+ years of engineering experience operating production infrastructure at large scale.
- Experience designing highly available distributed systems using Kubernetes, including GPU workloads.
- Kubernetes experience for both development and production support.
- Experience with one or more major cloud platforms: GCP, Azure, AWS, OCI, and/or multi-cloud or hybrid environments.
- Strong Linux-based infrastructure skills: designing, deploying, supporting, and troubleshooting complex systems.
- Ability to manage compute/storage/network resources and cost.
- Strong collaboration and troubleshooting skills for mission-critical operations.
- Familiarity with accelerator characteristics (GPUs/TPUs/custom accelerators) and how they affect latency and throughput.
- Strong working knowledge of distributed systems.
- Experience with Golang or C++ (or other languages for high-performance, scalable servers).
Nice to Haves
- Experience deploying and operating inference systems with detailed performance and cost optimization across environments.
About Cohere
Cohere is a security-first enterprise AI company that builds frontier foundation AI models and end-to-end products for real-world business problems. The company trains and deploys large language models for enterprises via APIs, with a global engineering team across major tech hubs.
Scraped 6/19/2026