Major Incident Lead | SRE, Cloud & Incident Management | Mission-Critical Platforms | Global HealthTech
Dublin, Ireland (initially remote → 4 days onsite)
Permanent, Full-time
The role:
You will lead the response to high-severity, customer-impacting incidents across a global managed services platform.
This is a hands-on incident leadership role, acting as the central point of coordination during major outages to ensure fast resolution, clear communication and minimal impact. You will operate in an SRE-led environment, combining real-time incident management with ongoing reliability improvements.
You will work across engineering, cloud and support teams to manage incidents end-to-end and drive post-incident improvements to prevent recurrence.
The role includes a paid on-call rotation, with incidents relatively infrequent — the focus is on being ready when it matters.
Non-Negotiables:
SRE, production operations or service reliability background
Experience in 24/7, mission-critical environments
Strong understanding of incident, problem and change management (ITIL)
Experience with cloud platforms (AWS, Azure or GCP)
Experience with incident tooling (ServiceNow, Jira, PagerDuty)
Strong stakeholder communication (including senior leadership)
Ability to lead under pressure and make clear, structured decisions
What You’ll Work With
ServiceNow / Jira / PagerDuty
Observability tooling (Grafana, Prometheus, Datadog, Splunk)
Cloud monitoring tools (CloudWatch)
Runbooks and incident playbooks
High-availability, distributed systems
Core Responsibilities:
Lead end-to-end management of major incidents (P1/P2)
Act as Incident Commander during live incidents
Coordinate across engineering, SRE, cloud and support teams
Deliver clear, timely updates to stakeholders and customers
Drive post-incident reviews and root cause analysis (RCA)
Ensure corrective and preventative actions are defined and tracked
Identify patterns, trends and repeat issues for improvement
Improve incident processes, tooling and runbooks
Ensure incidents meet SLA and regulatory requirements
Maintain audit-ready documentation and reporting
Participate in on-call rotation and incident simulations
Examples of the work:
Leading response to critical outages across cloud platforms
Coordinating multi-team incident resolution under time pressure
Communicating with senior stakeholders during live incidents
Running post-mortems to prevent repeat failures
Improving incident response processes and automation
Identifying reliability gaps across a 24/7 platform
Nice to Haves
Background in SRE or Site Reliability Engineering roles
Experience in MSP or managed services environments
Experience in regulated or healthcare systems
Exposure to data platforms or high-throughput systems
Interest in AI or automation in incident management
Why Join
You will join a team responsible for mission-critical platforms where reliability is non-negotiable.
This role sits at the centre of operations, giving you visibility across engineering, cloud and customer environments. You will have real ownership over how incidents are handled and how the platform improves over time.
The business is also investing heavily in its cloud and managed services capability, making this a strong opportunity to step into a role with long-term growth and impact.
Employee Benefits
12.5% annual bonus
Paid on-call allowance and call-out compensation
Clear progression as the Ireland function scales
High-impact role with exposure to global teams and leadership
Major Incident Lead | SRE, Cloud & Incident Management | Mission-Critical Platforms | Global HealthTech
#J-18808-Ljbffr