Job Description
We are seeking a highly skilled Senior Site Reliability Engineer to join our team. As a key member of our engineering organization, you will play a crucial role in ensuring the reliability and availability of our products and services.
Your primary responsibility will be to standardize key Site Reliability Engineering principles across our products, focusing on improving the overall quality and performance of our services. You will partner closely with engineering teams to ensure that our products meet architecture and observability design requirements.
Main Responsibilities:
* Monitor, measure, and improve the reliability, availability, and scalability of our products and infrastructure.
* Collaborate with engineering teams to perform operations readiness, ensuring that our products meet architecture and observability design requirements.
* Lead the new product introduction process for SRE, ensuring that our SRE team can provide operational support to our products.
* Embed across multiple engineering teams, participating in early-stage design discussions, and ensuring high-availability and scalability criteria are considered in product design.
* Identify manual routine operational practices and build robust automation capabilities using code and modern tools.
* Work with product developers and business stakeholders to gather requirements for enabling and improving performance monitoring for applications and services.
* Participate in incident response and post-mortem analysis to investigate root cause and capture contributing factors for remediation.
* Analyze previous incidents and trend/usage patterns to better predict issues and take proactive actions.
* Design and build custom tools as needed to support process optimization and improve operational efficiency.
* Participate in 24x7 rotational shifts and on-call for handling production operation issues.
* Engage in service capacity planning and demand forecasting, software performance analysis, and system tuning.
* Create meaningful dashboards/reports for application telemetry and infrastructure health to proactively identify performance constraints and bottlenecks.
Requirements
To succeed in this role, you will need to have the following skills and qualifications:
* A minimum of 7 years of experience with a strong understanding of cloud-based architecture and operations.
* Hands-on experience with Amazon Web Services is preferred.
* Experience in administration/build/management of Linux systems.
* A foundational understanding of infrastructure and platform technology stacks.
* A strong understanding of networking concepts and theories, including different protocols, VLAN configuration, DNS, OSI layers, and load balancing.
* An understanding of security architecture and certificate management.
* Working knowledge of infrastructure and application monitoring platforms such as Grafana Cloud, Xymon, LibreNMS etc.
* Working knowledge of incident response and alerting platforms such as PagerDuty, Opsgenie, XMatters etc.
* An understanding of core DevOps practices, including CI/CD pipeline, release management etc.
* The ability to write code using any one modern programming language (Python, JavaScript, Ruby etc.). Additional scripting skills are preferred.
* Configuration management platform understanding and experience (Chef/Puppet/Ansible)
* Prior experience in cloud management automation tools (Terraform/CloudFormation etc.) is crucial.
* Experience with source code management software and API automation is crucial.
* Cloud certifications or equivalent experience is highly regarded.
* A service availability-oriented mindset with a proactive approach to problem-solving.
* The ability to develop automated solutions to prevent recurring problems.
* A willingness to challenge the status quo and optimize current procedures and processes.
* A strong sense of ownership and an ability to drive cross-functional process improvement.
* Excellent interpersonal, written, and verbal communication skills.
* An analytical and logical approach to problem-solving and a willingness to automate repetitive tasks and reduce manual/reactive workload.
* The ability and willingness to coach and mentor team members and colleagues.