Overview
We are seeking a highly skilled Platform Automation Engineer with a strong software engineering background to join our Site Reliability Engineering (SRE) team. This role is coding-heavy, focused on developing automation, building resilient services, and ensuring observability and reliability at scale.
Location: Dublin (Hybrid)
Key Responsibilities
* Design, code, test, and deploy software to automate manual operational tasks.
* Develop APIs and services that enhance reliability, scalability, and observability.
* Build and manage software-based infrastructure components across cloud and hybrid environments.
* Reliability & Incident Management
* Troubleshoot priority incidents, lead post-mortems, and drive permanent resolutions.
* Balance operational support with engineering initiatives for optimal efficiency.
* Participate in rotational on-call support as needed.
* Observability Engineering
* Develop best-in-class monitoring frameworks with Prometheus, Grafana, CloudWatch, Azure Monitor, Honeycomb or similar.
* Implement noiseless alerting, end-to-end telemetry, and data-driven SLO improvements.
* Create automated solutions for upgrades, release management, and change processes.
* Backup & Recovery Automation
* Engineer cloud-native automated backup and recovery pipelines.
* Implement advanced data protection solutions including cyber vault isolation and code-driven recovery.
* Safeguard data integrity across cloud and on-premises environments.
* Work closely with Cloud Centre of Excellence (CCoE) and Development teams across the lifecycle.
* Coach and mentor team members; lead delivery on complex engineering tasks.
* Contribute to a culture of continuous improvement, reliability, and resilience.
Skills & Experience Required
* Software Engineering & Automation
* Proficiency in Python and/or Go for automation and service development.
* Strong experience with API design, development, and integration.
* Hands-on expertise in automation frameworks, CI/CD, and Infrastructure as Code (e.g., Terraform).
* SRE & Observability
* Experience in designing monitoring and observability solutions (Prometheus, Grafana, CloudWatch, Azure Monitor, etc.).
* Knowledge of performance monitoring, capacity management, and telemetry pipelines.
* Exposure to system troubleshooting, stability engineering, and incident response.
* Backup, Recovery & Data Protection
* Proven experience in automated backup and recovery solutions in cloud environments.
* Familiarity with data integrity, vaulting mechanisms, and code-driven resilience processes.
* Experience with container orchestration, compute, storage, and network services in cloud platforms.
* Understanding of security principles: SSO, Kerberos, LDAP, Active Directory, etc.
* Business Awareness
* Risk-aware mindset, with experience in production environments (financial services background a plus).
* Strong understanding of high-availability and resilience principles.
What We're Looking For
* A software-minded SRE who codes first, automates everything, and thrives on solving operational challenges through engineering.
* Someone who can balance coding, automation, and reliability with hands-on incident troubleshooting.
* A proactive engineer who drives improvements, shares knowledge, and helps shape a modern automation-driven SRE practice.
Seniority level
* Mid-Senior level
Employment type
* Full-time
Job function
* Information Technology
Industries
* Technology, Information and Media
#J-18808-Ljbffr