Software Engineering Manager
Affirm
See how well this job matches your profile
Sign up to get an AI match score and generate a tailored application in seconds.
Get your match scoreAbout the role
Join Affirm as a Software Engineering Manager to lead the Resilience Engineering team. This critical role focuses on ensuring the safety and reliability of production systems through proactive validation techniques. You will define and drive the vision for resilience engineering, lead and mentor a team of engineers, and partner with infrastructure, product, and security leadership. Additionally, you will establish best practices for testing system limits and failure scenarios in production, own the design and evolution of platforms for safe production experimentation, and drive reliability improvements. Key missions: Lead the development of systems and practices that allow engineers to safely test system behavior under stress and failure conditions in production.. Define and drive the vision for resilience engineering at Affirm, with a focus on production load testing and chaos engineering as first-class engineering practices.. Establish best practices for safely testing system limits and failure scenarios in production, and own the design and evolution of platforms that enable safe, controlled production load testing and fault injection. Profile: - Experience with leveraging a chaos engineering vendor such as Gremlin, Harness, or something similar - Strong programming background (e.g., Python, Kotlin, Java, or similar) - Strong understanding of failure modes in distributed systems, including latency, partial failure, and cascading outages - Strong communication and leadership skills, with a track record of influencing engineering practices across teams - Experience building or operating systems with strong safety guarantees (isolation, rate limiting, guardrails, auditability) - Hands-on experience with production load testing, chaos engineering, or large-scale system validation - Excellent problem-solving skills and the ability to balance long-term resilience investments with immediate business needs - Proven experience leading engineering teams in reliability, infrastructure, or distributed systems - Familiarity with cloud-native environments (AWS, Kubernetes) and observability tooling
Scraped 5/12/2026