Incident Management Specialist
The Role:
We are seeking a highly skilled Incident Management Specialist to join our team. As an Incident Manager, you will be responsible for responding to and resolving major incidents in a high-availability, high-transaction environment. Your expertise will be critical in ensuring the best customer and colleague experience.
Key Responsibilities:
* Respond to escalations from squads and vendors, including alerts from monitoring tools, using strong facilitation, planning, and time management skills.
* Assess and prioritize multiple incidents based on their impact on customers, business, regulators, reputation, and finances, knowing when to escalate without compromising Service Level Agreements (SLAs).
* Communicate incident status, resolution, and impact clearly and concisely to internal and external stakeholders; gather relevant information for regulators if needed.
* Facilitate timely communication to customers to help manage their experience using our communication tools.
* Host and/or participate in Post Mortem meetings with key stakeholders to identify root causes and ensure corrective actions are assigned and followed up.
* Create and progress problem tickets for recurrent issues, ensuring timely resolution in accordance with problem management processes.
* Promote a culture of learning that reduces repeat incidents through shared knowledge and best practices.
* Review incidents across all priorities to identify root causes and impacts, providing accurate reports to key forums for decision-making.
Requirements:
* Proven experience handling complex, major, and crisis incidents in high-availability, high-transaction environments, preferably including AWS.
* Experience working within agile, DevOps, or SRE models.
* Working knowledge of cloud-native monitoring platforms such as Prometheus, Thanos, Grafana, ElasticSearch & Kibana.
* Ability to lead influence, work calmly under pressure, and collaborate effectively to achieve the right outcomes.
* Experience with ITIL disciplines including Event, Incident, Problem, Change, and Continual Service Improvement (CSI).