We are seeking a Senior Network Site Reliability Engineer (NetSRE) in Ireland to help ensure the reliability, scalability, and operational excellence of mission‑critical network systems that power advanced AI workloads and distributed platforms. This fast‑scaling role blends networking, automation, and site reliability engineering to design resilient systems and optimize performance in a global AI cloud infrastructure environment.
Accountabilities
Define and manage reliability objectives for critical network services, including SLIs, SLOs, availability targets, and operational performance standards.
Lead initiatives to improve overall network reliability across infrastructure, inter‑site connectivity, and operational workflows.
Own incident response processes for networking environments, conduct root‑cause investigations, and implement long‑term corrective solutions.
Design and enhance observability systems through metrics, logging, tracing, alerting, and monitoring improvements to accelerate troubleshooting and recovery.
Build and maintain automation, CI/CD pipelines, testing environments, rollback mechanisms, and safe deployment processes for network changes.
Collaborate with platform engineering and infrastructure teams to improve operability, scalability, and reliability of networking systems.
Develop tooling and automation solutions using modern programming languages and infrastructure management practices.
Support operational readiness and scalability initiatives for high‑availability and high‑throughput networking environments.
Requirements
Strong experience in Site Reliability Engineering, Network Engineering, or Infrastructure Engineering roles within large‑scale production environments.
Solid Linux systems administration expertise and proven ability to troubleshoot complex distributed systems.
Strong understanding of networking fundamentals, including failure domains, latency, packet loss, control plane/data plane concepts, and high‑availability architectures.
Hands‑on experience operating and improving reliable production systems through automation and engineering best practices.
Proficiency in software development or scripting using Go, Python, or similar programming languages.
Experience with infrastructure‑as‑code, CI/CD pipelines, containerized environments, and operational automation tools.
Familiarity with observability, telemetry, monitoring systems, and incident management practices.
Ability to work collaboratively across engineering teams while maintaining strong ownership and communication skills.
Additional experience with eBPF/XDP, DPDK, large‑scale network telemetry, NAT64, load balancing, or advanced networking performance optimization is a strong plus.
Benefits
Competitive compensation package.
Flexible remote work options across Europe.
Career development and continuous learning opportunities.
Collaborative and engineering‑driven work environment.
Opportunity to contribute to cutting‑edge AI infrastructure projects.
Exposure to international teams and large‑scale distributed systems.
High‑impact role with strong ownership and technical influence.
Supportive culture focused on innovation, growth, and work‑life balance.
#J-18808-Ljbffr