About the role

Role Overview

Cohere is hiring a Site Reliability Engineer (Inference Infrastructure) for the Model Serving team. The team develops, deploys, and operates the AI platform that delivers Cohere’s large language models through API endpoints, focusing on low latency, high throughput, and high availability.

Responsibilities

Build self-service systems to automate managing, deploying, and operating services.
- Develop and maintain custom Kubernetes operators for language model deployments.
Automate environment observability and resilience.
Enable developers to troubleshoot and resolve issues effectively.
Take ownership of reliability targets by participating in an on-call rotation and helping meet defined SLOs.
Collaborate with internal teams and influence the Infrastructure roadmap based on developer feedback.
Improve team effectiveness through knowledge sharing and active review processes.
Interface with customers to support customized deployments meeting specific needs.

Requirements

5+ years of engineering experience operating production infrastructure at large scale.
Experience designing highly available distributed systems using Kubernetes, including GPU workloads.
Kubernetes experience for both development and production support.
Experience with one or more major cloud platforms: GCP, Azure, AWS, OCI, and/or multi-cloud or hybrid environments.
Strong Linux-based infrastructure skills: designing, deploying, supporting, and troubleshooting complex systems.
Ability to manage compute/storage/network resources and cost.
Strong collaboration and troubleshooting skills for mission-critical operations.
Familiarity with accelerator characteristics (GPUs/TPUs/custom accelerators) and how they affect latency and throughput.
Strong working knowledge of distributed systems.
Experience with Golang or C++ (or other languages for high-performance, scalable servers).

Nice to Haves

Experience deploying and operating inference systems with detailed performance and cost optimization across environments.

About Cohere

Cohere is a security-first enterprise AI company that builds frontier foundation AI models and end-to-end products for real-world business problems. The company trains and deploys large language models for enterprises via APIs, with a global engineering team across major tech hubs.

Site Reliability Engineer, Inference Infrastructure

Tags