Job Description
A Site Reliability Engineer is required to join the team on a permanent basis, working on a hybrid role with 3 days onsite. The ideal candidate will deliver cloud-based ERP solutions and ensure the reliability, scalability, and security of mission-critical systems through collaboration with software engineering, DevOps, and product teams.
Key Responsibilities
* Infrastructure Development: Design and maintain scalable and reliable infrastructure systems.
* Cloud Architecture: Build and manage resilient cloud-based solutions to support enterprise applications.
* Automation: Create scripts using Python, Unix Shell, and PowerShell to automate deployments, monitoring, and troubleshooting.
* Observability: Implement logging, monitoring, alerting, and performance tuning practices to proactively identify and resolve issues.
* Containerisation: Manage and deploy containerised applications using Docker, Kubernetes, and orchestration tools.
* Incident Management: Respond to infrastructure and network-related incidents, ensuring minimal impact on users.
* Security & Scalability: Collaborate with development teams to embed reliability, security, and scalability into the development lifecycle.
* Continuous Improvement: Participate in on-call rotations and drive initiatives to reduce operational overhead and increase system resilience.