xelys jobs xelys jobs

Senior Machine Learning Engineer

Recruits Lab

full-remoteseniorpermanentbackenddata United States 3 days ago via LinkedIn

See how well this job matches your profile

Sign up to get an AI match score and generate a tailored application in seconds.

Get your match score

Tags

Machine LearningLarge Language ModelsPyTorchJAXDistributed TrainingDeepSpeedMegatron-LMFSDPMoERLHF

About the role

Role Overview

Join a core model engineering team as a Senior Machine Learning Engineer focused on building and scaling large language models (10B–100B+ parameters). This role bridges applied research and production-grade engineering, with ownership across the training lifecycle and system performance.

What You’ll Own

  • Foundation Model Engineering (End-to-End)
    • Implement large-scale pre-training, SFT, and alignment pipelines.
    • Optimize architectures and training strategies using scaling laws and product goals.
    • Improve performance, reasoning capability, and training efficiency.
  • Distributed Training & Performance Optimization
    • Design and optimize multi-node GPU distributed training (A100/H100/B200).
    • Use Data / Tensor / Pipeline / Sequence Parallelism.
    • Maximize MFU and cluster efficiency.
    • Improve stability, fault tolerance, and monitoring.
  • High-Throughput Data Systems
    • Build TB–PB scale data pipelines.
    • Implement ingestion, cleaning, deduplication (MinHash/LSH), safety filtering, and PII removal.
    • Support multimodal strategies, synthetic data generation, and curriculum learning.
  • Applied LLM Research Implementation
    • Productionize alignment methods: RLHF, DPO, KTO.
    • Work with Mixture-of-Experts (MoE) and routing optimization.
    • Improve reasoning, math, and coding performance.
    • Build/extend agent and tool-calling systems.
  • Engineering Excellence
    • Maintain strong code/system design standards.
    • Identify and eliminate performance bottlenecks.
    • Own major components end-to-end.

Requirements

  • Education/Depth: MS/PhD in CS/AI/Math or equivalent practical experience.
  • Hands-on engineering optimizing large-scale deep learning systems.
  • Strong Transformer knowledge (e.g., RoPE, FlashAttention, SwiGLU).
  • Experience with modern LLMs (open-source or proprietary).
  • Proficiency in PyTorch or JAX.
  • Distributed training frameworks such as Megatron-LM, DeepSpeed, or FSDP.
  • Knowledge of 3D parallelism and ZeRO optimization.
  • Training on large GPU clusters (100+ GPUs preferred).
  • Familiarity with InfiniBand, RDMA, and storage I/O optimization.
  • Ability to debug large distributed training runs.

Nice to Have

  • Open-source contributions in LLM ecosystem.
  • Experience with agentic systems or multi-step reasoning frameworks.
  • CUDA/Triton kernel optimization.
  • Published research or major production LLM deployments.

About Recruits Lab

Recruits Lab is recruiting on behalf of a well-funded AI research company building next-generation foundation models at massive scale. The work focuses on developing and optimizing large language models, including distributed training, data pipelines, and model alignment techniques.

Scraped 4/17/2026

xelys jobs xelys jobs

Built for remote job seekers. Powered by AI.