OverviewWe are seeking a highly skilled Platform Automation Engineer with a strong software engineering background to join our Site Reliability Engineering (SRE) team. This role is coding-heavy, focused on developing automation, building resilient services, and ensuring observability and reliability at scale.Location: Dublin (Hybrid)Key ResponsibilitiesDesign, code, test, and deploy software to automate manual operational tasks.Develop APIs and services that enhance reliability, scalability, and observability.Build and manage software-based infrastructure components across cloud and hybrid environments.Reliability & Incident ManagementTroubleshoot priority incidents, lead post-mortems, and drive permanent resolutions.Balance operational support with engineering initiatives for optimal efficiency.Participate in rotational on-call support as needed.Observability EngineeringDevelop best-in-class monitoring frameworks with Prometheus, Grafana, CloudWatch, Azure Monitor, Honeycomb or similar.Implement noiseless alerting, end-to-end telemetry, and data-driven SLO improvements.Create automated solutions for upgrades, release management, and change processes.Backup & Recovery AutomationEngineer cloud-native automated backup and recovery pipelines.Implement advanced data protection solutions including cyber vault isolation and code-driven recovery.Safeguard data integrity across cloud and on-premises environments.Work closely with Cloud Centre of Excellence (CCoE) and Development teams across the lifecycle.Coach and mentor team members; lead delivery on complex engineering tasks.Contribute to a culture of continuous improvement, reliability, and resilience.Skills & Experience RequiredSoftware Engineering & AutomationProficiency in Python and/or Go for automation and service development.Strong experience with API design, development, and integration.Hands-on expertise in automation frameworks, CI/CD, and Infrastructure as Code (e.g., Terraform).SRE & ObservabilityExperience in designing monitoring and observability solutions (Prometheus, Grafana, CloudWatch, Azure Monitor, etc.).Knowledge of performance monitoring, capacity management, and telemetry pipelines.Exposure to system troubleshooting, stability engineering, and incident response.Backup, Recovery & Data ProtectionProven experience in automated backup and recovery solutions in cloud environments.Familiarity with data integrity, vaulting mechanisms, and code-driven resilience processes.Experience with container orchestration, compute, storage, and network services in cloud platforms.Understanding of security principles: SSO, Kerberos, LDAP, Active Directory, etc.Business AwarenessRisk-aware mindset, with experience in production environments (financial services background a plus).Strong understanding of high-availability and resilience principles.What We're Looking ForA software-minded SRE who codes first, automates everything, and thrives on solving operational challenges through engineering.Someone who can balance coding, automation, and reliability with hands-on incident troubleshooting.A proactive engineer who drives improvements, shares knowledge, and helps shape a modern automation-driven SRE practice.Seniority levelMid-Senior levelEmployment typeFull-timeJob functionInformation TechnologyIndustriesTechnology, Information and Media
#J-18808-Ljbffr