Reliability Specialist
About the Role:
This position plays a pivotal part in overseeing and improving Udemy's infrastructure, including content delivery networks and databases.
The Reliability Specialist is responsible for enhancing tools like Helm and Terraform, developing environments that empower engineering teams, and elevating reliability standards across the organization.
Collaborating closely with development teams, they design internal tools in Python and Golang while responding to incidents and driving best practices in reliability.
* Working with the reliability team on projects to improve and optimize infrastructure and tooling.
* Championing reliability best practices throughout Udemy's engineering organization.
* Designing and implementing powerful, scalable tools to meet internal customer demands.
* Supporting and maintaining platforms like Kubernetes clusters and CI/CD pipelines.
* Contributing to incident management, identifying root causes, and driving continuous reliability improvements.
* Participating in the on-call rotation to support mission-critical systems.