xelys jobs xelys jobs

Senior ML Infrastructure / DevOps Engineer

Pathway

full-remoteseniorpermanentdevopsbackend Anywhere in the World Today via WWR

See how well this job matches your profile

Sign up to get an AI match score and generate a tailored application in seconds.

Get your match score

Tags

LinuxDevOpsSREKubernetesSlurmDockerTerraformCI/CDPrometheusGrafana

About the role

Role overview

Pathway is seeking a Senior ML Infrastructure / DevOps Engineer to own production infrastructure for ML training and low-latency inference. You’ll work closely with R&D while focusing on clusters, networks, storage, observability, and automation, across multiple cloud providers.

Responsibilities

  • Operate and scale GPU-heavy clusters for daily R&D training and low-latency inference.
  • Design, build, and automate the ML platform (not just run playbooks).
  • Work across multiple cloud providers, addressing networking, scheduling, and cost/performance optimization.
  • Design, operate, and scale GPU/CPU clusters using Slurm and Kubernetes, including autoscaling, queueing, and quota management.
  • Automate provisioning and configuration with infrastructure-as-code (Terraform, CloudFormation) and cluster tooling.
  • Build and maintain ML pipelines for data ingestion, training, evaluation, and deployment with reproducibility, traceability, and rollback.
  • Implement and evolve ML-centric CI/CD for testing, packaging, and deployment of models/services.
  • Own monitoring, logging, and alerting for training and serving (GPU/CPU utilization, latency, throughput, failures, and data/model drift) using Grafana, Prometheus, Loki, CloudWatch.
  • Handle terabyte-scale datasets and related storage/networking/performance challenges.
  • Partner with ML engineers/researchers to productionize experimental work.
  • Participate in on-call rotation; lead incident response and post-mortems.

Requirements

  • 5+ years in DevOps/SRE/Platform/Infrastructure roles running production systems, ideally for high-performance or ML workloads.
  • Strong Linux background as a daily driver, including shell scripting and cluster/service configuration.
  • Comfort debugging at OS and network layers (e.g., systemd, filesystems, iptables/security groups, DNS, TLS, routing).
  • Strong experience with workload management, containerization, and orchestration in production (Slurm, Docker, Kubernetes).

Nice to have

  • Deep experience scaling ML infrastructure across multiple cloud providers with networking/scheduling/cost optimization.

About Pathway

Pathway builds post-transformer AI technology, including a breakthrough BDH architecture designed to outperform Transformers and provide enterprises full visibility into model behavior. The company pairs this with a fast data processing engine to enable contextual, experience-driven intelligence for large organizations. It is trusted by customers such as NATO, La Poste, and Formula 1 teams.

Scraped 4/9/2026

xelys jobs xelys jobs

Built for remote job seekers. Powered by AI.