Job Title
Serve as a highly skilled technical professional responsible for ensuring the reliability and performance of our cloud-based platform.
* This is a swing shift role (4 days a week) requiring location within the Republic of Ireland.
About Us
We serve over 8,100 customers, including 85% of the Fortune 500. Our intelligent platform connects people, systems, and processes to empower organizations to work smarter, faster, and better.
Your Role
You will be responsible for managing and resolving challenging issues for our ServiceNow SRE team, focusing on instance performance, reliability, and availability.
Key Responsibilities
* Provide relief and sustainable resolution to infrastructure issues.
* Conduct root cause analysis of incidents and implement preventive measures.
* Participate in troubleshooting bridges and provide support during critical incidents.
* Use software development, systems engineering, and networking expertise to proactively prevent repeatable issues.
* Drive initiatives with partner teams to improve infrastructure reliability and performance through improved system design.
* Design, develop, and maintain scalable and reliable systems.
* Implement and manage monitoring, alerting, and incident response processes.
* Collaborate with development teams to ensure the reliability and performance of new features.
* Automate repetitive tasks to improve efficiency and reduce human error.
* Innovate and continuously improve system reliability, performance, and capacity.
Requirements
* Experience integrating AI into work processes, decision-making, or problem-solving.
* 8+ years of experience in a Site Reliability Engineering or similar role.
* Degree in Computer Science, Engineering, or related field.
* Self-motivated go-getter attitude with proven ability to lead and drive initiatives across the organization.
* Ability to inspire collaboration, navigate ambiguity, and drive initiatives from concept to execution.
* Extensive experience with ITIL-based IT operations, including incident, problem, and change management.
* Advanced expertise in Unix/Linux system administration.
* Proficient in automation tools and security best practices.
* Comprehensive knowledge of networking protocols and relational databases.
* Experience with infrastructure-as-code and configuration management tools.
* Strong programming skills in languages such as Python, Go, or Java.
* Cloud experience across AWS, Azure, or GCP.
* Proficiency in using monitoring and logging tools like Splunk, Prometheus, Grafana, or ELK stack.
* Experience with Kubernetes to orchestrate container deployment and management.
* Excellent problem-solving skills and attention to details.
* Excellent written and verbal communication skills.