About the Role
We are seeking a highly skilled Site Reliability Engineer to join our team. This individual will be responsible for designing and maintaining monitoring solutions, implementing automation tools, and ensuring application reliability and performance.
Key Responsibilities:
* Design and maintain monitoring solutions for infrastructure, application performance, and user experience.
* Implement automation tools to streamline tasks, scale infrastructure, and ensure seamless deployments.
* Ensure application reliability, availability, and performance, minimizing downtime and optimizing response times.
* Lead incident response, including identification, triage, resolution, and post-incident analysis.
* Conduct capacity planning, performance tuning, and resource optimization.
* Collaborate with security teams to implement best practices and ensure compliance.
* Manage deployment pipelines and configuration management for consistent and reliable app deployments.
* Develop and test disaster recovery plans and backup strategies.
* Collaborate with development, QA, DevOps, and product teams to align on reliability goals and incident response processes.
Requirements:
* Proficiency in development technologies, architectures, and platforms (web, API).
* Experience with cloud platforms (AWS, Azure, Google Cloud) and IaC tools.
* Knowledge of monitoring tools (Prometheus, Grafana, DataDog) and logging frameworks (Splunk, ELK Stack).
* Experience in incident management and post-mortem reviews.
* Strong troubleshooting skills for complex technical issues.
* Proficiency in scripting languages (Python, Bash) and automation tools (Terraform, Ansible).
* Experience with CI/CD pipelines (Jenkins, GitLab CI/CD, Azure DevOps).
* Ownership approach to engineering and product outcomes.
* Excellent interpersonal communication, negotiation, and influencing skills.