Job Title:
Cloud System Reliability Engineer
About the Role:
We are seeking a skilled Cloud System Reliability Engineer to join our team. As a Cloud System Reliability Engineer, you will play a critical role in ensuring the high availability, reliability, and performance of our Azure-hosted services.
Key Responsibilities:
* Evaluate system design and architecture to ensure scalability, reliability, and security.
* Design, implement, and maintain robust monitoring, alerting, and observability tooling to detect and resolve issues promptly.
* Develop and execute automated provisioning, deployment, scaling, and incident response processes using infrastructure as code tools like ARM, Bicep, or Terraform.
* Leverage data-driven approaches to optimize system performance and capacity, ensuring alignment with business objectives.
* Maintain strong security and compliance posture, adhering to industry standards such as ISO27001, SOC 2, and GDPR.
Requirements:
* Proven experience in a SaaS or software product environment, with a focus on Microsoft Azure infrastructure and services.
* Strong scripting and automation skills, preferably in PowerShell.
* Familiarity with monitoring platforms such as Azure Monitor, Grafana, Prometheus, or Datadog.
* Knowledge of containerization (Docker/Kubernetes) and CI/CD support tools.
* Skilled in incident response, root cause analysis, and system resilience.
* Understanding of cloud security and regulatory compliance standards.
Desirable Qualifications:
* Azure certifications (Administrator, Architect).
* Experience in the London Insurance Market.
* Familiarity with ServiceNow, PagerDuty, or other incident management tools.
Work Setup:
* Hybrid – 2 days onsite.