Role Overview
We are seeking a Senior Site Reliability Engineer (SRE) with strong expertise in AWS cloud infrastructure, containerised platforms, and Azure DevOps CI/CD pipelines. The successful candidate will focus on improving system reliability, availability, performance, and scalability while enabling engineering teams to deliver high‑quality services efficiently.
This role combines engineering and operational excellence, with a focus on automation, observability, scalability, and resilience across cloud‑native environments. As a senior engineer, you will drive engineering‑led solutions to reduce operational toil, enhance system reliability, and promote DevOps and SRE best practices.
Note: This is a reliability‑focused engineering role with on‑call responsibilities and involvement in platform modernisation initiatives.
Key Responsibilities
Design, implement, and manage highly available and scalable infrastructure on AWS.
Build, maintain, and optimise DevOps Pipelines (CI/CD) for automated build, test, and deployment processes.
Implement end‑to‑end CI/CD workflows, including multi‑stage pipelines, approvals, and release strategies.
Manage and support Windows (IIS, .NET) and Linux‑based production systems.
Deploy, manage, and optimise containerised applications using Docker and Kubernetes (EKS/AKS).
Implement Infrastructure as Code (IaC) using Terraform, CloudFormation, or ARM.
Develop and maintain automation scripts using PowerShell, Bash, or Python.
Define and monitor SLIs, SLOs, and SLAs to ensure system reliability.
Implement robust monitoring, logging, and alerting solutions (CloudWatch, Prometheus, Grafana, Azure Monitor).
Lead incident management, troubleshooting, and root cause analysis (RCA) for production issues.
Drive performance tuning and capacity planning for applications and infrastructure.
Collaborate with development teams to improve deployment strategies (blue‑green, canary releases).
Ensure security, compliance, and best practices across CI/CD pipelines and infrastructure.
Qualifications
Required Skills & Experience
8+ years of experience in Site Reliability Engineering / DevOps / Infrastructure Engineering
Strong hands‑on experience with AWS services (EC2, S3, RDS, VPC, IAM, ELB, Auto Scaling, CloudWatch)
Deep expertise in Azure DevOps Pipelines (CI/CD), including YAML pipelines and release automation
Experience designing multi‑stage pipelines and deployment strategies
Expertise in Windows Server administration, including IIS and .NET application support
Strong experience with Linux system administration
Hands‑on experience with Docker and Kubernetes (EKS/AKS)
Experience with Infrastructure as Code (Terraform, CloudFormation, or ARM templates)
Strong scripting skills in PowerShell (mandatory) and Bash/Python
Experience with monitoring and logging tools (Prometheus, Grafana, ELK, CloudWatch)
Solid understanding of networking, security, and cloud architecture principles
Preferred Qualifications
Experience with hybrid cloud or multi‑cloud environments
Knowledge of Active Directory, Group Policy, and enterprise Windows environments
Familiarity with Helm, GitOps practices, or service mesh technologies
Experience with performance testing and tuning
Relevant certifications (AWS, Kubernetes, Azure DevOps)
Key Competencies / Characteristics
Reliability‑driven: focused on uptime, performance, and system resilience
Automation‑first mindset: continuously reduces manual effort and operational toil
Ownership mentality: takes end‑to‑end responsibility from design through production
Strong communicator: clearly articulates incidents, RCA outcomes, and technical concepts
Collaborative: works effectively with platform, security, and application teams
Mentorship mindset: actively supports and develops junior team members
Continuous learner: keeps up with evolving SRE practices and cloud‑native technologies
D&I statement
#J-18808-Ljbffr