Where we workUdemy is a global company headquartered in San Francisco, with additional U.S. offices in Denver and Austin, and international hubs in Australia, India, Ireland, Mexico, and Türkiye. This is an in-office position, requiring three days a week in the office (Tuesday, Wednesday, Thursday) and flexibility on Mondays and Fridays.About your skills Extensive knowledge of cloud technologies, with AWS experience being highly advantageous. Proven expertise in managing containerized workloads using Kubernetes in production environments. Proficiency in programming languages such as Python, Golang, or Kotlin. Strong familiarity with infrastructure-as-code (IaC) tools like Terraform and Helm. About this role As a Senior Site Reliability Engineer at Udemy, you'll play a critical role in managing and evolving our infrastructure, from our CDN to our databases. You'll oversee and improve tools like Helm and Terraform, build development environments that empower our engineering teams, and enhance reliability standards across the organization. Collaborating closely with dev teams, you'll also design internal tools in Python and Golang while responding to incidents and driving best practices in reliability.What you'll be doing Working with the SRE team on projects to enhance and optimize our infrastructure and tooling in collaboration with the SRE team and engineering teams across Udemy.. Championing SRE best practices throughout Udemy's engineering organization. Designing and implementing powerful, scalable tools to meet internal customer demands. Supporting and maintaining platforms like Kubernetes clusters and CI/CD pipelines. Contributing to incident management, identifying root causes, and driving continuous reliability improvements. Participating in the on-call rota to support mission-critical systems. What you'll have Hands-on experience managing Kubernetes clusters and cloud environments at scale. Solid expertise in deploying infrastructure using infrastructure-as-code tools. Strong proficiency in writing tools and applications using languages such as Python, Golang, or Kotlin. Proven capability of being part of an on-call rotation and managing incidents effectively. A track record of working with diverse engineering teams, providing guidance on best practices for reliability and scalability. Excellent communication skills with a collaborative mindset, including the ability to both give and receive feedback constructively.We understand that not everyone will match each of the above qualifications. However, we also realize that everyone has unique experiences that can add value to our company. Even if you think your background might not perfectly align, we'd love to hear from you#LI-SO1