We are seeking a highly skilled Reliability Engineering Leader to drive innovation and growth in our banking, payments, and capital markets business.
Job Description:
The successful candidate will play a critical role in driving end-to-end visibility and proactive issue detection through the design and evolution of observability, monitoring, and alerting systems. They will also be responsible for implementing scalable automation frameworks for infrastructure provisioning, deployment pipelines, and operational tasks, ensuring application reliability, availability, and performance.
Key Responsibilities:
* Lead the design and evolution of observability, monitoring, and alerting systems
* Implement scalable automation frameworks
* Ensure application reliability, availability, and performance
* Own incident management processes
* Mentor and guide colleagues
Requirements:
* Proven experience in a Principal or Lead SRE/DevOps/Infrastructure Engineering role within complex, high-availability environments
* Deep expertise in cloud platforms (AWS, Azure, or GCP) and Infrastructure as Code (Terraform, CloudFormation, etc.)
* Strong background in monitoring tools (Prometheus, Grafana, DataDog) and logging frameworks (Splunk, ELK Stack)
* Advanced proficiency in scripting and automation (Python, Bash, Ansible)
* Demonstrated leadership in incident response and post-mortem culture
What We Offer:
* A collaborative work environment with flexible working hours
* A competitive salary and attractive range of benefits designed to support your lifestyle and wellbeing
* The opportunity to grow your technical skillset in a challenging and varied work environment