Job Title: Cloud Reliability Expert
We are looking for a self-motivated, driven and creative professional to join our Site Reliability team. As a key member of our engineering team, you will be responsible for designing, implementing and operating large scale distributed systems.
Responsibilities:
* Build and automate tools for incident impact analysis and effective mitigation
* Write high quality code that is easy to maintain and test
* Craft and build platform components needed to power the observability stack
* Ensure design and architecture is extensible across projects, and participate in technical design and code reviews
* Work with Product Management, collaborators and other developers to understand design requirements and provide estimates for development
* Stay ahead of with the latest observability standard methodologies and share your findings with the team
* Work as part of a cross-site development team
* Raise issues proactively that might impact delivery commitments
* Work with Operations and Incident Command teams during and post incidents to drive excellence in Incident Management Process
* Compose and analyze dashboard to highlight areas of the business that need attention and help drive organizational KPI
* Create and respond to system generated alerts to maintain system health
* Work with Operations and Engineers to fill any gaps in alerting and telemetry
* Coaching and mentoring other team members with new technologies