Build large-scale, massively distributed systems to ensure reliability and uptime. Site Reliability Engineering (SRE) combines software and systems engineering to build and run our services.
Requirements:
* Bachelor's degree in Computer Science or related field
* 5 years of experience with software development in one or more programming languages
* 3 years of experience designing, analyzing, and troubleshooting large-scale distributed systems
* 2 years of experience leading projects and providing technical leadership
Preferred Qualifications:
* Master's degree in Computer Science or Engineering
* Experience with Large Language Models (LLMs) and Generative AI agents
As an SRE on our team, you'll work closely with developers to roll out the latest ML/LLM technologies to a wide range of users and products. You'll help implement platforms that power agentic capabilities for Gemini and other major product areas across Google. We're looking for someone who can prioritize understanding and meeting the needs of real users, especially emerging users, by delivering what they need.
Your Responsibilities:
1. Improve the efficiency of services by optimizing resource allocation and achieving more with fewer resources
2. Break down large, complex systems into manageable components, determining when and how to do this
3. Collaborate with the development team to improve system performance and reliability