MLOps/AI Infrastructure Engineer
Centific
See how well this job matches your profile
Sign up to get an AI match score and generate a tailored application in seconds.
Get your match scoreTags
About the role
Role overview
Centific is hiring an MLOps / AI Infrastructure Engineer to build and operate an on-premises, GPU-focused AI platform. The platform runs in environments like air-gapped, physically secured, and government-compliant facilities, requiring “bulletproof” infrastructure for high-throughput training and low-latency inference.
Responsibilities
GPU compute & hardware infrastructure
- Deploy, configure, and maintain on-prem GPU servers (primarily NVIDIA H200 and A100) including:
- Driver management, CUDA toolkit versioning
- NVLink/NVSwitch topology and related tuning
- Firmware updates
- Implement and tune NVIDIA tooling:
- DCGM for GPU health monitoring and telemetry
- MIG for multi-tenant workload partitioning
- NVIDIA Container Toolkit for GPU-aware containers
- Manage bare-metal provisioning workflows (e.g., iPXE/PXE, MAAS/Foreman) for repeatable, auditable builds
- Monitor hardware health/capacity/thermal-power envelopes; define alerting and respond to failures with minimal disruption
Kubernetes & container orchestration
- Build, upgrade, and maintain production Kubernetes clusters on bare metal using kubeadm or Rancher RKE2
- Configure GPU node pools via NVIDIA GPU Operator
- Design and operate networking for AI workloads using Calico, Cilium, or SR-IOV (RDMA when required)
- Configure and manage MetalLB (or equivalent), ingress controllers, and service mesh components (Istio or Linkerd)
- Enforce platform stability and isolation with:
- Resource quotas, LimitRanges, PriorityClasses
- Node affinity/taints to prevent resource contention
- Maintain security and compliance with:
- RBAC, Pod Security Admission, network policies
- Secrets management (HashiCorp Vault or Sealed Secrets)
- CIS Kubernetes Benchmark compliance
MLOps pipelines & AI workload management
- Deploy and operate MLOps platforms such as MLflow and Kubeflow (or equivalent)
- Configure NVIDIA Triton Inference Server for multi-model serving, dynamic batching, and ensembles
- Build CI/CD pipelines for model deployment using GitOps (e.g., ArgoCD or Flux), including:
- automated model validation
- canary rollouts and rollback mechanisms
- Optimize GPU utilization for training (Volcano or KUEUE) and latency-sensitive inference; use DCGM and Prometheus for efficiency metrics
- Manage model artifacts and versioning with storage backends such as Ceph RBD/CephFS or MinIO integrated into the MLOps toolchain
Networking & storage architecture
- Design and implement the high-bandwidth networking fabric required to interconnect GPU clusters (edge to core compute), suitable for high-throughput sensor data and AI workload needs.
About Centific
Centific provides an AI platform designed to run outside hyperscaler cloud environments, including on-premises, government facilities, and network edge locations. The platform focuses on reliable infrastructure for AI training and inference under stringent security and compliance requirements. The role supports end-to-end deployment through MLOps and scalable AI infrastructure engineering.
Scraped 4/10/2026