Staff Site Reliability Engineer
Stellar Cyber
full-remoteleadpermanentbackenddevops Full remote 2 days ago via WTTJ
See how well this job matches your profile
Sign up to get an AI match score and generate a tailored application in seconds.
Get your match scoreTags
Site Reliability EngineeringKubernetesTerraformCI/CDArgoCDObservabilityPrometheusGrafanaLokiIncident Management
About the role
Staff Site Reliability Engineer (SRE)
Join Stellar Cyber to drive reliability, scalability, and efficiency across production systems.
Responsibilities
- Administer and maintain Kubernetes/container orchestration platforms and containerized workloads to ensure high availability and resilience.
- Improve observability by enhancing monitoring, logging, and alerting across systems and data platforms.
- Build and maintain CI/CD pipelines for efficient and reliable deployments, applying Infrastructure as Code (IaC) practices.
- Lead or influence architecture, tooling, and SRE best practices as a senior member of the team.
- Own production on-call operations, incident management, and reliability-focused culture.
Requirements
- 5+ years in Site Reliability Engineering, DevOps, or Platform Engineering.
- Advanced Kubernetes administration and troubleshooting.
- Deep understanding of IaC (e.g., Terraform, Helm).
- Experience with CI/CD tools such as GitHub Actions, Bitbucket, and ArgoCD.
- Strong observability: Prometheus, Grafana, Loki, Alertmanager.
- Strong production incident management/on-call experience.
- Expertise operating data platforms including Elasticsearch and MongoDB, plus other listed systems.
- Strong distributed systems, databases, networking, and Linux administration background.
- Automation/programming skills in Python and Bash.
- Proven success operating large-scale production systems in public cloud environments (AWS/GCP/Azure/OCI).
- Excellent problem-solving, communication, and leadership skills.
Nice-to-haves
- Knowledge of AI agents for auto-triaging alerts and correlating signals to form/root-cause hypotheses.
- Experience with chat-based operations interfaces and/or auto-remediation controllers using AI agentic frameworks.
- Certifications in AWS/GCP/Observability/Linux/Kubernetes.
Location
- Full remote
About Stellar Cyber
Stellar Cyber is a technology company focused on operational excellence and reliable production systems. The role described centers on building and operating scalable cloud infrastructure, observability, and deployment pipelines for mission-critical platforms in the cyber/AI data space.
Scraped 5/15/2026