Transformative technology is at the heart of our mission to revolutionize cloud infrastructure. We are pioneers in vertically integrated, purpose-built AI solutions trusted by Fortune 500 companies to power their most advanced AI applications.
About This Role:
SREs play a pivotal role in ensuring the reliability and performance of our infrastructure. Our SRE team detects, analyzes, and prevents issues to maintain high Service Level Agreements through Service Level Indicators and Service Level Objectives. Through automation and proactive remediation, our SREs not only resolve common errors automatically but also advise various engineering teams in building resilient code.
What You'll Be Working On:
Your day begins with a review of overnight alerts and system performance metrics to ensure everything is running smoothly. We operate close to the hardware, so many of these will be novel hardware failures on cutting-edge equipment that are challenging to diagnose and automate. Your tasks might include automating routine processes, analyzing system logs, and developing tools to enhance our monitoring capabilities. You'll spend part of your day working closely with software engineers, advising on best practices for resilient code and reviewing changes before deployment. Regularly, you will engage in incident response drills, post-mortems, and root cause analysis sessions to learn from past issues and prevent future ones. Throughout the day, you will stay focused on maintaining high SLIs and SLOs, ensuring that our infrastructure remains robust and reliable for our customers.
We prioritize anticipating and resolving issues before they impact our customers. Our customer-centric approach ensures that clients always have access to the virtual machines they depend on. By day's end, you will document your work, share insights with your team, and plan for the next day's challenges, always with a customer-centric mindset.