Key to ensuring the highest-availability and lowest-latency cloud platform is building systems to detect and mitigate operational issues before they impact customers. The team is responsible for designing and implementing systems that automate fault containment, problem diagnosis, and issue resolution across multiple hugely-distributed, always-on architectures. These systems take metric and dependency data from multiple sources and analyse them, correlating them with customer impact to determine root cause of an issue without human intervention. They create engagements, facilitate communication and coordination of the response and mitigation. This role will be part of the team driving adoption of the software that has been built by the team, influencing systems development practices for new and existing products. You will define availability goals for service teams and strategies to make these goals attainable with minimal effort. Your goal will be to remove human-error from the day-to-day operations of the massive, always-on, distributed systems. Responsibilities: