xelys jobs xelys jobs

Machine Learning Operations Engineer

The Associated Press

midpermanentbackenddevops United States 2 days ago via LinkedIn

See how well this job matches your profile

Sign up to get an AI match score and generate a tailored application in seconds.

Get your match score

Tags

AWS SageMakerMachine Learning OperationsML InferencePyTorchTensorFlowMonitoringModel DeploymentAutoscalingA/B TestingDrift Detection

About the role

Role Overview

The Machine Learning Operations Engineer at The Associated Press is an individual-contributing production operations role focused on runtime behavior, infrastructure, reliability, and cost control. The role partners with Machine Learning Engineers, Data Scientists, and Platform Engineering to deploy, operate, scale, monitor, and govern ML workloads across Dev, QA, and Production.

Responsibilities

  • Design, deploy, and operate end-to-end production ML pipelines across Dev, QA, and Prod.
  • Set up and manage AWS SageMaker pipelines, endpoints, and monitoring for large-scale inference (e.g., embedding generation, named entity recognition, reranking, video processing).
  • Select, benchmark, and optimize GPU/CPU infrastructure, including autoscaling and load testing.
  • Deploy and run inference services supporting hundreds of thousands of queries per day across text, image, and video pipelines.
  • Establish standardized ML deployment patterns, including containerization and orchestration, environment isolation, and versioned promotion with rollback/recovery.
  • Implement production monitoring: latency, error rates, throughput, drift detection, and evaluation metrics.
  • Enable A/B testing and controlled rollout strategies for ML models.
  • Partner with engineering and platform teams to operationalize models and pipeline improvements safely.
  • Manage high-throughput I/O and data movement for large media asset collections while avoiding CPU, network, and storage bottlenecks.
  • Enforce reproducibility, observability, security, and cost controls to reduce operational risk.

Requirements

  • 5+ years deploying and operating ML inference systems in production.
  • Strong experience with AWS SageMaker (pipelines, endpoints, monitoring) and multi-environment deployments.
  • Operational serving expertise with PyTorch and TensorFlow.
  • Proven experience with model deployment and orchestration (e.g., containerized inference and autoscaling).
  • Experience selecting, evaluating, and optimizing compute resources (GPU/CPU) for production workloads.
  • Experience with ML monitoring/evaluation metrics and A/B testing frameworks.
  • Ability to collaborate effectively in a shared-ownership model with ML Engineers, Data Scientists, and platform teams.

Nice to Have

  • Operational experience with transformer-based NLP models (e.g., BERT-family), computer vision models, and ranking/reranking systems.
  • Familiarity with operating common ML model types and ranking/retrieval methods, including ANN (e.g., HNSW).
  • Experience running ML workloads over large-scale data/media collections.

About The Associated Press

The Associated Press (AP) is an independent global news organization founded in 1846. It delivers trusted, factual journalism across all formats and also provides essential technology and services that support the news business.

Scraped 6/14/2026

xelys jobs xelys jobs

Built for remote job seekers. Powered by AI.