Senior Machine Learning Engineer
Recruits Lab
full-remoteseniorpermanentbackenddata United States 3 days ago via LinkedIn
See how well this job matches your profile
Sign up to get an AI match score and generate a tailored application in seconds.
Get your match scoreTags
Machine LearningLarge Language ModelsPyTorchJAXDistributed TrainingDeepSpeedMegatron-LMFSDPMoERLHF
About the role
Role Overview
Join a core model engineering team as a Senior Machine Learning Engineer focused on building and scaling large language models (10B–100B+ parameters). This role bridges applied research and production-grade engineering, with ownership across the training lifecycle and system performance.
What You’ll Own
- Foundation Model Engineering (End-to-End)
- Implement large-scale pre-training, SFT, and alignment pipelines.
- Optimize architectures and training strategies using scaling laws and product goals.
- Improve performance, reasoning capability, and training efficiency.
- Distributed Training & Performance Optimization
- Design and optimize multi-node GPU distributed training (A100/H100/B200).
- Use Data / Tensor / Pipeline / Sequence Parallelism.
- Maximize MFU and cluster efficiency.
- Improve stability, fault tolerance, and monitoring.
- High-Throughput Data Systems
- Build TB–PB scale data pipelines.
- Implement ingestion, cleaning, deduplication (MinHash/LSH), safety filtering, and PII removal.
- Support multimodal strategies, synthetic data generation, and curriculum learning.
- Applied LLM Research Implementation
- Productionize alignment methods: RLHF, DPO, KTO.
- Work with Mixture-of-Experts (MoE) and routing optimization.
- Improve reasoning, math, and coding performance.
- Build/extend agent and tool-calling systems.
- Engineering Excellence
- Maintain strong code/system design standards.
- Identify and eliminate performance bottlenecks.
- Own major components end-to-end.
Requirements
- Education/Depth: MS/PhD in CS/AI/Math or equivalent practical experience.
- Hands-on engineering optimizing large-scale deep learning systems.
- Strong Transformer knowledge (e.g., RoPE, FlashAttention, SwiGLU).
- Experience with modern LLMs (open-source or proprietary).
- Proficiency in PyTorch or JAX.
- Distributed training frameworks such as Megatron-LM, DeepSpeed, or FSDP.
- Knowledge of 3D parallelism and ZeRO optimization.
- Training on large GPU clusters (100+ GPUs preferred).
- Familiarity with InfiniBand, RDMA, and storage I/O optimization.
- Ability to debug large distributed training runs.
Nice to Have
- Open-source contributions in LLM ecosystem.
- Experience with agentic systems or multi-step reasoning frameworks.
- CUDA/Triton kernel optimization.
- Published research or major production LLM deployments.
About Recruits Lab
Recruits Lab is recruiting on behalf of a well-funded AI research company building next-generation foundation models at massive scale. The work focuses on developing and optimizing large language models, including distributed training, data pipelines, and model alignment techniques.
Scraped 4/17/2026