xelys jobs xelys jobs

Software Architect (Reliability Engineering)

Twilio

full-remotearchitectpermanentbackendengineering-management Full remote 2 days ago via WTTJ

See how well this job matches your profile

Sign up to get an AI match score and generate a tailored application in seconds.

Get your match score

Tags

SREReliability EngineeringKubernetesAWSTerraformObservabilityPrometheusGrafanaDatadogDistributed Systems

About the role

Role overview

Join Twilio as a Software Architect (Reliability Engineering) (full remote). You’ll lead the technical strategy and vision for Twilio’s reliability initiatives, ensuring products are reliable worldwide. You will influence architectural decisions, design scalable reliability solutions, mentor engineers and technical leaders, and help drive adoption of SRE and cloud best practices.

Key missions

  • Define and lead reliability solutions and initiatives to ensure global product reliability.
  • Partner with technical and product teams to identify reliability risks and turn them into actionable designs, programs, and tools.
  • Establish and promote reliability practices and drive systemic improvements across the organization.

Responsibilities

  • Lead design and implementation of scalable reliability architectures.
  • Drive technical decision-making and long-term reliability strategy.
  • Mentor and grow engineers and technical leaders.
  • Track and apply emerging SRE and cloud best practices.
  • Run cross-functional post-incident reviews and drive improvements.

Requirements

  • 15+ years experience in Reliability Engineering, Software Engineering, and/or DevOps roles, including principal/architect-level experience.
  • Strong production experience (operational management, scaling, partitioning strategies, performance and reliability tuning) in high-scale environments.
  • Deep understanding of the role of Reliability Engineering in a large, diverse SaaS organization.
  • Hands-on experience with Kubernetes (e.g., EKS), deploying/managing stateful services, and cloud services such as AWS.
  • Knowledge of cloud architecture, DevOps practices, and large-scale systems design with microservices.
  • Expertise in observability and monitoring/alerting for distributed systems (e.g., Prometheus, Grafana, Datadog).
  • Proven ability to influence cross-org architectural outcomes and build effective working relationships.
  • Experience owning and operating large AWS environments.
  • Ability to design and implement incident response, SLOs/SLIs, runbooks, and participate in on-call rotations.
  • Strong distributed systems fundamentals (consensus, durability, throughput, availability tradeoffs).
  • Experience with data/reliable streaming technologies such as Apache Kafka or AWS MSK.
  • Proficiency in at least one programming language for automation and tooling (e.g., Go, Python, Java).

Nice to have / additional

  • Experience driving reliability improvements in data-intensive or mission-critical systems.
  • Experience in infrastructure-as-code (e.g., Terraform or CloudFormation).
  • Bachelor’s or Master’s degree in CS/Engineering (or equivalent experience).

About Twilio

Twilio is a cloud communications platform company that provides reliable APIs and services for messaging, voice, and related communications. The role sits within Twilio’s Reliability Engineering organization, focused on ensuring Twilio’s products remain dependable worldwide across large-scale cloud systems.

Scraped 5/15/2026

xelys jobs xelys jobs

Built for remote job seekers. Powered by AI.