xelys jobs xelys jobs

Site Reliability Engineer

CRG - People and Technology

full-remotemidpermanent United States Yesterday via LinkedIn

See how well this job matches your profile

Sign up to get an AI match score and generate a tailored application in seconds.

Get your match score

About the role

Role Overview

Product Reliability Engineer / SRE for an innovative infrastructure platform company. This role bridges customer experience and engineering excellence, focusing on ensuring platform reliability across diverse customer environments.

Key Responsibilities

  • Partner with customers and internal teams to manage L2/L3 escalations, resolving complex deployment, upgrade, and runtime issues
  • Perform deep root cause analysis and reproduce issues through to resolution
  • Collaborate with engineering teams to implement fixes and prevent recurrence
  • Develop and maintain diagnostic tooling (health checks, support bundles, environment validators)
  • Improve and scale test automation and CI environments, reducing flakiness and increasing reliability
  • Build reproducible environments for debugging and integration testing
  • Establish performance baselines and regression testing frameworks
  • Identify recurring failure patterns and improve installation and upgrade robustness
  • Write production-quality code in Python, Go, or Rust to enhance platform reliability
  • Transform customer issues into actionable improvements across testing, observability, and product design

Required Experience

  • 4–7 years in SRE, Platform Engineering, or Production Engineering roles
  • Strong hands-on experience with Kubernetes (troubleshooting networking, storage, RBAC, multi-environment setups)
  • Experience in startup environments
  • Solid programming skills in Python, Go, or Rust
  • Proven ability to diagnose complex distributed systems using logs, metrics, and tracing
  • Experience managing customer-facing escalations or high-severity incidents
  • Strong problem-solving skills for complex, ambiguous issues
  • Excellent written and verbal communication
  • Comfortable working remotely with high autonomy

Desirable Skills

  • Experience with containers, Helm, and deployment/upgrade workflows
  • Knowledge of CI/CD at scale, including test parallelization and reproducible environments
  • Familiarity with performance testing and profiling tools
  • Background in customer-facing engineering roles (support, solutions, escalation engineering)
  • Open-source contributions, particularly in infrastructure or observability

About CRG - People and Technology

CRG - People and Technology partners with an innovative technology business that develops a schema-driven infrastructure platform serving as a single source of truth for modern systems. The platform is designed for on-premise deployment where reliability is a core product feature.

Scraped 3/31/2026

xelys jobs xelys jobs

Built for remote job seekers. Powered by AI.