Who you are
* Have experience managing and scaling reliability or infrastructure engineering teams
* Possess deep technical knowledge of distributed systems observability and monitoring at scale
* Understand the unique challenges of operating AI infrastructure and can guide technical decisions
* Have successfully implemented SLO/SLA frameworks and can drive adoption across organizations
* Bring experience with both traditional infrastructure metrics and AI-specific performance indicators
* Can effectively lead technical discussions while translating between ML engineers and infrastructure teams
* Have excellent leadership and communication skills, with ability to influence at all levels
* Demonstrate strong hiring and talent development capabilities
* Have managed teams operating large-scale model training or serving infrastructure (>1000 GPUs)
* Bring hands-on experience with ML hardware accelerators (GPUs, TPUs, Trainium, etc.)
* Understand ML-specific networking optimizations and their operational implications
* Have led teams through major reliability transformations or infrastructure migrations
* Possess experience building reliability engineering practices from the ground up
* Have contributed to or led open-source infrastructure or ML tooling initiatives
* Demonstrate thought leadership in the reliability engineering community
What the job involves
* Anthropic is seeking an experienced engineering leader to manage our Reliability Engineering team
* This team includes Software Engineers and Systems Engineers focused on defining and achieving reliability metrics for all of Anthropic's internal and external products and services
* As a manager, you'll lead the team that's significantly improving reliability for Anthropic's services while pioneering the use of modern AI capabilities to reengineer how we approach reliability engineering
* This leadership role is critical to Anthropic's mission to bring groundbreaking AI technologies to benefit humanity in a safe and reliable way
* Lead and grow a team of reliability engineers responsible for large language model serving and training systems
* Drive the development of service level objectives (SLOs) that balance availability/latency with development velocity across the organization
* Oversee the design and implementation of comprehensive monitoring systems for availability, latency and other critical metrics
* Guide your team in architecting high-availability language model serving infrastructure capable of supporting millions of external customers and high-traffic internal workloads
* Lead the strategy for automated failover and recovery systems across multiple regions and cloud providers
* Establish and manage incident response processes for critical AI services, ensuring your team drives rapid recovery and systematic improvements
* Direct cost optimization initiatives for large-scale AI infrastructure, with focus on accelerator (GPU/TPU/Trainium) utilization and efficiency
* Partner with cross-functional teams to align reliability engineering efforts with broader company objectives
* Build a strong engineering culture focused on reliability, operational excellence, and innovation
Benefits
* Comprehensive health, dental, and vision insurance for you and your dependents
* Inclusive fertility benefits via Carrot Fertility
* Generous subsidy for OneMedical
* 21 weeks of paid parental leave
* Unlimited PTO
* Optional equity donation matching at a 3:1 ratio, up to 50% of your equity grant
* 401(k) plan with 4% matching
* $500/month flexible wellness stipend
* Commuter coverage
* Annual education stipend
* A home office improvement stipend when you first join
* Relocation support for those moving to the Bay Area