Senior sre, site reliability engineer

Dublin

Klaviyo

Site reliability engineer

Posted: 2 February

Offer description

Senior Site Reliability Engineer – Site Reliability Engineering (Dublin)Team OverviewAs a senior Site Reliability engineer, you'll ensure Klaviyo's critical platforms are reliable, scalable, and sustainable while enabling rapid product development. We treat reliability as a core product feature and use software engineering to solve complex systems and operational challenges.Our work spans security, infrastructure, and software development, requiring us to understand systems and engineering. We build complex, foundational solutions that must be extremely reliable, secure, and performant at global scale.Our charter is to build and operate foundational services and infrastructure, define clear reliability objectives, reduce operational toil through automation, and continuously improve systems based on real production learnings. The work is highly visible and directly impacts how Klaviyos build software and how customers experience Klaviyo every day.How you'll make an impactAs a Senior Site Reliability Engineer, you will build and operate the platforms, systems, and services that underpin Klaviyo's reliability and operational excellence. You will:Build and operate foundational, security-critical services with a strong emphasis on availability, scalability, latency, and fault toleranceApply software engineering principles to automate infrastructure, reduce operational toil, and improve system reliability at scaleDesign, implement, and evolve systems using SRE best practicesDefine and refine SLIs, SLOs, and error budgets to guide engineering decisionsImprove observability, alerting, and incident response to reduce mean time to detection and recoveryParticipate in on-call rotations with a focus on sustainable operations and automatic remediations Perform quantitative analysis to understand system behavior, capacity constraints, and scaling limitsIdentify systemic risks and reliability bottlenecks and drive long-term, preventative solutionsCollaborate closely with product, platform, and security engineers to influence architecture early and ship reliable systemsMentor and pair with other engineers, helping raise the bar for reliability, operational maturity, and engineering excellenceWho you areYou are a cloud-native, platform-focused SRE who uses software to build and operate reliable production systems at scale.You write and maintain production-quality code (e.g. Python, Go, or similar) to build internal platforms, automate operations, and improve system reliabilityYou have built, deployed, and operated distributed, cloud-native systems and understand failure modes such as partial outages, dependency failures, resource saturation, and cascading impactYou have experience operating containerized workloads and platforms (e.g. Kubernetes) in production, including deployment strategies, scaling behavior, and service networkingYou are comfortable participating in on-call rotations and diagnosing production issuesYou have designed and operated observability systems and know how to build actionable alerts that reflect real user and service impactYou apply SRE concepts such as SLIs, SLOs, error budgets, and burn-rate–based alerting to guide engineering decisions and operational responseYou have hands-on experience with infrastructure as code and declarative configuration (e.g. Terraform, Kubernetes manifests, policy-as-code)You have performed capacity planning, load testing, and performance analysis for distributed services and platformsYou routinely contribute to post-incident reviews and drive concrete, code-focused follow-up actions that prevent recurrenceYou are comfortable reviewing and contributing to technical designs, platform APIs, operational runbooks, and system documentationYou've already experimented with AI in work or personal projects, and you're excited to dive in and learn fast. You're hungry to responsibly explore new AI tools and workflows, finding ways to make your work smarter and more efficient.Nice to haveExperience supporting security-critical platforms or building internal security toolingFamiliarity with identity, access management, secrets management, or policy enforcement systemsExperience operating systems at scale in cloud environments (AWS preferred)Background in resilience testing, fault injection, or chaos engineeringA strong comprehension of algorithms and data structures at scaleTech StackKlaviyo's platform is primarily built with Python and React and runs on AWS. Engineers join us from a wide range of technical backgrounds and are supported in learning our stack.Core technologies include:Python / Django / FastAPIMySQL / Redis / MemcachedRabbitMQ / Celery / Apache Kafka / Apache PulsarAWS / Terraform / KubernetesLocation & Work ModelThis role is based in Dublin, Ireland and follows a hybrid working model. Klaviyo supports work authorization and relocation for this position.At Klaviyo, we enjoy tackling meaningful engineering challenges and value people who take ownership, learn continuously, and collaborate openly. We are committed to building inclusive teams and encourage applications from candidates of all backgrounds.Klaviyo is growing fast and we have openings for all skill levels across all of our teams. Learn more about our engineering culture at We use Covey as part of our hiring and / or promotional process. For jobs or candidates in NYC, certain features may qualify it as an AEDT. As part of the evaluation process we provide Covey with job requirements and candidate submitted applications. We began using Covey Scout for Inbound on April 3, 2025.Please see the independent bias audit report covering our use of Covey here

Apply

Create an E-mail Alert

Save

Similar job

Site reliability engineer (sre)

Dublin

Waratek Enterprise

Site reliability engineer

Similar job

Site reliability engineer

Dublin

Apple

Site reliability engineer

Similar job

Site reliability engineer, emergency and disaster resilience

Dublin

Google

Site reliability engineer