Senior Site Reliability Engineer
* Remote role with cutting-edge, expanding organisation
* 60-70K plus 5% Bonus, 6% Pension, Healthcare for you and family, Life Cover
Key Responsibilities:
* Collaborate with engineering teams to enhance service quality through robust testing, performance tuning, and fault identification.
* Develop automated solutions to maintain systems and services, ensuring smooth project execution by working closely with internal engineering teams.
* Oversee system performance by implementing continuous monitoring and balancing feature development with system reliability, adhering to established service level objectives.
* Contribute to the formulation of practices, technologies, and procedures to maintain Security, Compliance, and Availability requirements across system landscapes.
* Manage, plan, and execute system upgrades to ensure minimal downtime and optimal system availability.
Required Skills & Qualifications:
* Kubernetes:
Extensive expertise in managing, deploying, and troubleshooting production Kubernetes clusters, with experience in container orchestration. Familiarity with Amazon EKS is an advantage.
* Automation & Configuration Management:
Proficiency with Ansible, Helm, and Kustomize for automating infrastructure provisioning and deployment. Skilled at managing Kubernetes manifests and ensuring streamlined application releases across different environments.
* Monitoring Tools:
Hands-on experience with systems like Prometheus and Grafana to monitor system health, identify issues, and optimize performance.
* Cloud Infrastructure (AWS):
Strong knowledge of AWS services such as EC2, S3, IAM, VPC, and associated tools for managing scalable cloud infrastructure.
* Infrastructure as Code (IaC):
Experience with Terraform for provisioning and maintaining cloud resources, ensuring repeatability and version control in cloud deployments.
* Messaging & Queuing Systems:
Familiarity with message brokers such as RabbitMQ, Kafka, or managed services like AmazonMQ, with experience in optimizing reliable communication between distributed systems.
* Database Expertise:
Strong background in managing cloud-based MySQL databases, particularly with Amazon RDS, focusing on high availability, security, and performance.
* Networking & Security:
Solid understanding of network security and design to ensure system protection, compliance, and industry-standard audit readiness.
* High Availability Systems:
Demonstrated experience in maintaining critical system uptime through fault tolerance, disaster recovery, and proactive monitoring to minimize downtime.
* Collaboration & Cross-functional Teamwork:
Proven ability to work effectively across multiple teams, departments, and stakeholders to execute project plans efficiently.
* Programming:
Competency in high-level programming languages such as Python, Go, or JavaScript, with a strong grasp of modern development tools and CI/CD pipelines for automating testing, deployment, and monitoring.
* Problem Solving & Optimization:
Strong problem-solving skills with a proactive approach to identifying bottlenecks, system issues, and opportunities for performance improvements.