xelys jobs xelys jobs

Software Engineer

Baseten

full-remotemidpermanentbackenddevops Full remote Today via WTTJ

See how well this job matches your profile

Sign up to get an AI match score and generate a tailored application in seconds.

Get your match score

Tags

GoKubernetesObservabilityDistributed SystemsML TrainingMLOpsAWSGCPTemporalAirflow

About the role

Role Overview

Join Baseten as a Software Engineer on the Training Infrastructure team. You will architect and lead the development of Baseten’s training platform, making key infrastructure technical decisions that help developers deploy, scale, and monitor workloads reliably.

Responsibilities

  • Architect and lead the training platform’s infrastructure and technical direction.
  • Design scalable infrastructure for ML training, including:
    • Scheduling
    • Storage
    • Networking
  • Partner closely with developers and research engineers.
  • Drive long-term improvements in:
    • System reliability
    • Development velocity
    • Technical strategy
  • Develop/own observability systems for ML infrastructure.
  • Improve distributed performance and reliability across training workloads.

Requirements

  • Proven experience designing observability systems.
  • Bachelor’s degree or higher in Computer Science (or related field).
  • Advanced distributed systems knowledge and performance tuning.
  • Deep expertise with Kubernetes in production.
  • Proficiency in Go (with Python experience as a plus).
  • Extensive experience with major cloud providers (AWS, GCP).
  • Experience with distributed storage systems.
  • Experience with workload orchestration platforms such as Temporal or Airflow.

Nice to Haves

  • Experience with ML/AI workloads and MLOps platforms.
  • Experience with additional cloud/compute providers (e.g., Crusoe, DigitalOcean, Nebius).
  • Familiarity with open-source training stack/frameworks (e.g., NCCL, PyTorch, Megatron, NemoRL, VeRL, Axolotl, HF Trainer).
  • Experience with distributed training techniques (FSDP, DeepSpeed).
  • Experience building AI products, tooling, or agents.

About Baseten

Baseten is a remote-first company focused on building infrastructure and platforms for machine learning developers. Its Training Infrastructure team works on scalable systems that enable efficient deployment, scaling, and monitoring of ML workloads with high performance and reliability.

Scraped 5/12/2026

xelys jobs xelys jobs

Built for remote job seekers. Powered by AI.