Role Overview

Join a core model engineering team as a Senior Machine Learning Engineer focused on building and scaling large language models (10B–100B+ parameters). This role bridges applied research and production-grade engineering, with ownership across the training lifecycle and system performance.

What You’ll Own

Foundation Model Engineering (End-to-End)
- Implement large-scale pre-training, SFT, and alignment pipelines.
- Optimize architectures and training strategies using scaling laws and product goals.
- Improve performance, reasoning capability, and training efficiency.
Distributed Training & Performance Optimization
- Design and optimize multi-node GPU distributed training (A100/H100/B200).
- Use Data / Tensor / Pipeline / Sequence Parallelism.
- Maximize MFU and cluster efficiency.
- Improve stability, fault tolerance, and monitoring.
High-Throughput Data Systems
- Build TB–PB scale data pipelines.
- Implement ingestion, cleaning, deduplication (MinHash/LSH), safety filtering, and PII removal.
- Support multimodal strategies, synthetic data generation, and curriculum learning.
Applied LLM Research Implementation
- Productionize alignment methods: RLHF, DPO, KTO.
- Work with Mixture-of-Experts (MoE) and routing optimization.
- Improve reasoning, math, and coding performance.
- Build/extend agent and tool-calling systems.
Engineering Excellence
- Maintain strong code/system design standards.
- Identify and eliminate performance bottlenecks.
- Own major components end-to-end.

Requirements

Education/Depth: MS/PhD in CS/AI/Math or equivalent practical experience.
Hands-on engineering optimizing large-scale deep learning systems.
Strong Transformer knowledge (e.g., RoPE, FlashAttention, SwiGLU).
Experience with modern LLMs (open-source or proprietary).
Proficiency in PyTorch or JAX.
Distributed training frameworks such as Megatron-LM, DeepSpeed, or FSDP.
Knowledge of 3D parallelism and ZeRO optimization.
Training on large GPU clusters (100+ GPUs preferred).
Familiarity with InfiniBand, RDMA, and storage I/O optimization.
Ability to debug large distributed training runs.

Nice to Have

Open-source contributions in LLM ecosystem.
Experience with agentic systems or multi-step reasoning frameworks.
CUDA/Triton kernel optimization.
Published research or major production LLM deployments.

Senior Machine Learning Engineer

Tags

About the role

Role Overview

What You’ll Own

Requirements

Nice to Have

About Recruits Lab