xelys jobs xelys jobs

Staff Site Reliability Engineer

Fieldguide

full-remoteleadpermanentdevopsbackend Full remote Today via WTTJ

See how well this job matches your profile

Sign up to get an AI match score and generate a tailored application in seconds.

Get your match score

Tags

Site Reliability Engineering (SRE)TerraformAWSSLO/SLIObservabilityDatadogPrometheusGrafanaIncident ManagementDistributed Systems

About the role

Role Overview

Join Fieldguide as a Staff Site Reliability Engineer (SRE) to lead the reliability, scalability, and observability strategy for the platform. You’ll influence system design and reliability practices across multiple teams, ensuring reliability is built into products from the ground up.

Responsibilities

  • Lead the design and evolution of highly scalable, fault-tolerant distributed systems across cloud infrastructure.
  • Define and drive adoption of Service Level Objectives (SLOs), Service Level Indicators (SLIs), and error budgets across engineering teams.
  • Establish and enforce best practices for incident response, on-call, and operational excellence.
  • Own root cause analysis and drive systemic reliability improvements.
  • Provide technical leadership, set engineering standards, and mentor engineers.

Requirements

  • 10+ years in software engineering, focused on distributed systems and production infrastructure.
  • Deep expertise in system reliability, scalability, and performance engineering at scale.
  • Strong software engineering fundamentals (ability to contribute to and review complex codebases).
  • Proficiency with Infrastructure as Code, especially Terraform (or equivalent).
  • Strong experience with observability (e.g., Datadog, Prometheus, Grafana).
  • Experience operating and scaling distributed systems in the cloud, with strong preference for AWS.
  • Proven ability to lead incident management, run post-mortems, and improve production operations.
  • Experience designing/operating multi-region and globally distributed systems.
  • Expertise in distributed tracing and performance analysis.
  • Hands-on experience with database scalability and performance tuning.

Nice-to-Haves

  • Familiarity with compliance-driven environments (e.g., SOC 2, FedRAMP).
  • Experience applying chaos engineering to validate and improve resilience.
  • Experience building or scaling an SRE function in a high-growth organization.
  • Strong written and verbal communication skills for translating complex ideas to diverse audiences.
  • Ability to balance tactical needs with strategic architectural improvements.

Scraped 5/12/2026

xelys jobs xelys jobs

Built for remote job seekers. Powered by AI.