Senior ML Infrastructure / DevOps Engineer
Pathway
See how well this job matches your profile
Sign up to get an AI match score and generate a tailored application in seconds.
Get your match scoreTags
About the role
Role overview
Pathway is seeking a Senior ML Infrastructure / DevOps Engineer to own production infrastructure for ML training and low-latency inference. You’ll work closely with R&D while focusing on clusters, networks, storage, observability, and automation, across multiple cloud providers.
Responsibilities
- Operate and scale GPU-heavy clusters for daily R&D training and low-latency inference.
- Design, build, and automate the ML platform (not just run playbooks).
- Work across multiple cloud providers, addressing networking, scheduling, and cost/performance optimization.
- Design, operate, and scale GPU/CPU clusters using Slurm and Kubernetes, including autoscaling, queueing, and quota management.
- Automate provisioning and configuration with infrastructure-as-code (Terraform, CloudFormation) and cluster tooling.
- Build and maintain ML pipelines for data ingestion, training, evaluation, and deployment with reproducibility, traceability, and rollback.
- Implement and evolve ML-centric CI/CD for testing, packaging, and deployment of models/services.
- Own monitoring, logging, and alerting for training and serving (GPU/CPU utilization, latency, throughput, failures, and data/model drift) using Grafana, Prometheus, Loki, CloudWatch.
- Handle terabyte-scale datasets and related storage/networking/performance challenges.
- Partner with ML engineers/researchers to productionize experimental work.
- Participate in on-call rotation; lead incident response and post-mortems.
Requirements
- 5+ years in DevOps/SRE/Platform/Infrastructure roles running production systems, ideally for high-performance or ML workloads.
- Strong Linux background as a daily driver, including shell scripting and cluster/service configuration.
- Comfort debugging at OS and network layers (e.g., systemd, filesystems, iptables/security groups, DNS, TLS, routing).
- Strong experience with workload management, containerization, and orchestration in production (Slurm, Docker, Kubernetes).
Nice to have
- Deep experience scaling ML infrastructure across multiple cloud providers with networking/scheduling/cost optimization.
About Pathway
Pathway builds post-transformer AI technology, including a breakthrough BDH architecture designed to outperform Transformers and provide enterprises full visibility into model behavior. The company pairs this with a fast data processing engine to enable contextual, experience-driven intelligence for large organizations. It is trusted by customers such as NATO, La Poste, and Formula 1 teams.
Scraped 4/9/2026