xelys jobs xelys jobs

Site Reliability Engineer

Baseten

full-remotemidpermanentdevopsother Full remote Today via WTTJ

See how well this job matches your profile

Sign up to get an AI match score and generate a tailored application in seconds.

Get your match score

Tags

Site Reliability EngineeringSREKubernetesGrafanaLokiPrometheusIncident ResponseMachine Learning OpsHelmCI/CD

About the role

Role Overview

Baseten is hiring a Site Reliability Engineer (SRE) to serve as the primary technical owner for its most strategic customers. You’ll ensure smooth deployments, strong performance, and reliable operation of machine learning workloads in production.

Responsibilities

  • Act as primary post-sales technical owner for top enterprise accounts, ensuring reliable deployment and operations of ML workloads.
  • Diagnose and resolve runtime and infrastructure issues (including production-grade debugging).
  • Lead incident response during outages/escalations and coordinate between Product, FDE, Sales, and Engineering.
  • Maintain and improve runbooks and proactively identify patterns/failure modes.
  • Translate customer-reported pain points into product improvements, roadmap insights, and documentation enhancements.
  • Manage escalations end-to-end: issue resolution, root-cause analysis, and communication.

Requirements

  • Deep Kubernetes troubleshooting expertise, including advanced resource debugging, pod/runtime analysis, and log-based diagnostics using observability tools (e.g., Grafana, Loki, Prometheus).
  • 3+ years in a fast-paced, high-growth, or customer-facing engineering environment.
  • Strong communication skills and executive presence during high-visibility incidents.
  • Strong infrastructure debugging across container orchestration, networking, and service dependencies.
  • Experience managing high-severity incidents with major customers, including SLAs and post-incident reviews.
  • Proven project management and organizational skills with an ownership mindset.

Nice to Have

  • Familiarity with running/troubleshooting high-performance AI models and ML pipelines (preprocessing → inference → serving).
  • Experience implementing or managing incident/ticketing systems such as Zendesk or Pylon.
  • Familiarity with Helm, Flux, CI/CD tooling, or scripting automations for deployment and operational workflows.

About Baseten

Baseten is a remote-first company focused on making machine learning accessible to everyone. It supports the deployment and operations of ML workloads for strategic customers, emphasizing reliability, performance, and production readiness.

Scraped 5/12/2026

xelys jobs xelys jobs

Built for remote job seekers. Powered by AI.