Site Reliability Engineer
Empower
See how well this job matches your profile
Sign up to get an AI match score and generate a tailored application in seconds.
Get your match scoreTags
About the role
Role overview
Site Reliability Engineer (SRE) responsible for the reliability, availability, and operational excellence of Empower’s AWS-based data platform. You’ll apply core SRE practices—production engineering, incident management, root-cause elimination, observability, automation, and capacity planning—to large-scale data infrastructure that powers EMR, EMR Serverless, Redshift, DynamoDB, and S3.
Responsibilities
- Own and improve reliability, stability, scalability, and performance of core data platforms and services
- Provide operational support for large-scale, distributed data systems and ensure strong SLAs
- Partner with full-stack, data, and platform engineering teams to drive continuous improvements
- Operate and support EMR / EMR Serverless workloads and data pipelines (Python/Spark)
- Support and optimize Amazon Redshift and DynamoDB in high-throughput production environments
- Design, build, and evolve monitoring, alerting, and observability frameworks (focus on symptoms, not just outages)
- Lead incident response and perform troubleshooting across the full stack; coordinate with internal and external stakeholders
- Conduct root cause analysis (RCA) and readiness reviews; implement durable fixes and automation
- Create and maintain runbooks, SOPs, and operational documentation
- Collaborate to optimize performance, reliability, and cost
- Participate in an on-call rotation for incidents impacting customer-facing systems
- Recommend AWS managed services and architectural patterns
- Continuously evaluate performance, capacity, and cost to scale efficiently
Requirements
- 4–6 years of experience building/operating systems across application, data, integration, infrastructure, and security domains
- 4+ years of hands-on AWS experience with strong production exposure to several of: Redshift, DynamoDB, EMR, EMR Serverless, EC2, S3, Lambda, Step Functions, EventBridge, RDS, IAM
- Proven experience operating data platforms such as data lakes and data warehouses in production
- Strong SQL skills with modern databases (e.g., Redshift, DynamoDB, Postgres, MySQL, Oracle)
- 4+ years of Python experience (scripting, automation, or data workloads)
- Experience with CloudWatch and monitoring/alerting
- Hands-on incident management and uptime SLAs in customer-impacting environments
- Strong understanding of Git-based workflows (e.g., GitHub, Git Flow)
- Experience in Agile environments (Scrum/Kanban) using tools such as Jira and Confluence
- Bachelor’s degree in CS/Information Systems/Data/Analytics or related (or equivalent practical experience)
Additional notes
- Applicants must be authorized to work in the U.S.; the company is unable to sponsor or take over sponsorship of an employment visa (including CPT/OPT).
About Empower
Empower is a financial services company focused on helping customers achieve financial freedom. It emphasizes internal mobility, well-being, and building reliable, purpose-driven teams, with a product and platform ecosystem that includes cloud-based data infrastructure.
Scraped 4/3/2026