xelys jobs xelys jobs

MLOps/AI Infrastructure Engineer

Centific

hybridseniorpermanentbackenddevopsdata Redmond, WA 46 days ago via LinkedIn

See how well this job matches your profile

Sign up to get an AI match score and generate a tailored application in seconds.

Get your match score

Tags

MLOpsKubernetesNVIDIA GPU OperatorCUDADCGMMIGTriton Inference ServerGitOpsArgoCDCeph

About the role

Role overview

Centific is hiring an MLOps / AI Infrastructure Engineer to build and operate an on-premises, GPU-focused AI platform. The platform runs in environments like air-gapped, physically secured, and government-compliant facilities, requiring “bulletproof” infrastructure for high-throughput training and low-latency inference.

Responsibilities

GPU compute & hardware infrastructure

  • Deploy, configure, and maintain on-prem GPU servers (primarily NVIDIA H200 and A100) including:
    • Driver management, CUDA toolkit versioning
    • NVLink/NVSwitch topology and related tuning
    • Firmware updates
  • Implement and tune NVIDIA tooling:
    • DCGM for GPU health monitoring and telemetry
    • MIG for multi-tenant workload partitioning
    • NVIDIA Container Toolkit for GPU-aware containers
  • Manage bare-metal provisioning workflows (e.g., iPXE/PXE, MAAS/Foreman) for repeatable, auditable builds
  • Monitor hardware health/capacity/thermal-power envelopes; define alerting and respond to failures with minimal disruption

Kubernetes & container orchestration

  • Build, upgrade, and maintain production Kubernetes clusters on bare metal using kubeadm or Rancher RKE2
  • Configure GPU node pools via NVIDIA GPU Operator
  • Design and operate networking for AI workloads using Calico, Cilium, or SR-IOV (RDMA when required)
  • Configure and manage MetalLB (or equivalent), ingress controllers, and service mesh components (Istio or Linkerd)
  • Enforce platform stability and isolation with:
    • Resource quotas, LimitRanges, PriorityClasses
    • Node affinity/taints to prevent resource contention
  • Maintain security and compliance with:
    • RBAC, Pod Security Admission, network policies
    • Secrets management (HashiCorp Vault or Sealed Secrets)
    • CIS Kubernetes Benchmark compliance

MLOps pipelines & AI workload management

  • Deploy and operate MLOps platforms such as MLflow and Kubeflow (or equivalent)
  • Configure NVIDIA Triton Inference Server for multi-model serving, dynamic batching, and ensembles
  • Build CI/CD pipelines for model deployment using GitOps (e.g., ArgoCD or Flux), including:
    • automated model validation
    • canary rollouts and rollback mechanisms
  • Optimize GPU utilization for training (Volcano or KUEUE) and latency-sensitive inference; use DCGM and Prometheus for efficiency metrics
  • Manage model artifacts and versioning with storage backends such as Ceph RBD/CephFS or MinIO integrated into the MLOps toolchain

Networking & storage architecture

  • Design and implement the high-bandwidth networking fabric required to interconnect GPU clusters (edge to core compute), suitable for high-throughput sensor data and AI workload needs.

About Centific

Centific provides an AI platform designed to run outside hyperscaler cloud environments, including on-premises, government facilities, and network edge locations. The platform focuses on reliable infrastructure for AI training and inference under stringent security and compliance requirements. The role supports end-to-end deployment through MLOps and scalable AI infrastructure engineering.

Scraped 4/10/2026

xelys jobs xelys jobs

Built for remote job seekers. Powered by AI.