Senior sre, site reliability engineer

Klaviyo Inc.

Site reliability engineer

Posted: 6 February

Offer description

Senior Site Reliability Engineer – Site Reliability Engineering (Dublin)
Team Overview
As a senior Site reliability engineer, you'll ensure Klaviyo's critical platforms are reliable, scalable, and sustainable while enabling rapid product development.
We treat reliability as a core product feature and use software engineering to solve complex systems and operational challenges.
Our work spans security, infrastructure, and software development, requiring us to understand systems and engineering.
We build complex, foundational solutions that must be extremely reliable, secure, and performant at global scale.
Our charter is to build and operate foundational services and infrastructure, define clear reliability objectives, reduce operational toil through automation, and continuously improve systems based on real production learnings.
The work is highly visible and directly impacts how Klaviyo's build software and how customers experience Klaviyo every day.
How you'll make an impact
As a Senior Site Reliability Engineer, you will build and operate the platforms, systems, and services that underpin Klaviyo's reliability and operational excellence.
You will:
Build and operate foundational, security‑critical services with a strong emphasis on availability, scalability,, and fault tolerance
Apply software engineering principles to automate infrastructure, reduce operational toil, and improve system reliability at scale
Design, implement, and evolve systems using SRE best practices
Define and refine SLIs, SLOs, and error budgets to guide engineering decisions
Improve observability, alerting, and incident response to reduce mean time to detection and recovery
Participate in on‑call rotations with a focus on sustainable operations and automatic remediations
Perform quantitative analysis to understand system behavior, capacity constraints, and scaling limits
Identify systemic risks and reliability bottlenecks and drive long‑term, preventative solutions
Collaborate closely with product, platform, and security engineers to influence architecture early and ship reliable systems
Mentor and pair with other engineers, helping raise the bar for reliability, operational maturity, and engineering excellence
Who you are
You are a cloud‑native, platform‑focused SRE who uses software to build and operate reliable production systems at scale.
You write and maintain production‑quality code (e.g.
Python, Go, or similar) to build internal platforms, automate operations, and improve system reliability
You have built, deployed, and operated distributed, cloud‑native systems and understand failure modes such as partial outages, dependency failures, resource saturation, and cascading impact
You have experience operating containerized workloads and platforms (e.g.
Kubernetes) in production, including deployment strategies, scaling behavior, and service networking
You are comfortable participating in on‑call rotations and diagnosing production issues
You have designed and operated observability systems and know how to build actionable alerts that reflect real user and service impact
You apply SRE concepts such as SLIs, SLOs, error budgets, and burn‑rate–based alerting to guide engineering decisions and operational response
You have hands‑on experience with infrastructure as code and declarative configuration (e.g.
Terraform, Kubernetes manifests, policy‑as‑code)
You have performed capacity planning, load testing, and performance analysis for distributed services and platforms
You routinely contribute to post‑incident reviews and drive concrete, code‑focused follow‑up actions that prevent recurrence
You are comfortable reviewing and contributing to technical designs, platform APIs, operational runbooks, and system documentation
You've already experimented with AI in work or personal projects, and you're excited to dive in and learn fast.
You're hungry to responsibly explore new AI tools and workflows, finding ways to make your work smarter and more efficient.
Nice to have
Experience supporting security‑critical platforms or building internal security tooling
Familiarity with identity, access management, secrets management, or policy enforcement systems
Experience operating systems at scale in cloud environments (AWS preferred)
Background in resilience testing, fault injection, or chaos engineering
A strong comprehension of algorithms and data structures at scale
Tech Stack
Klaviyo's platform is primarily built with Python and React and runs on AWS.
Engineers join us from a wide range of technical backgrounds and are supported in learning our stack.
Core technologies include:
Python / Django / Fast API
My SQL / Redis / Memcached
Rabbit MQ / Celery / Apache Kafka / Apache Pulsar
AWS / Terraform / Kubernetes
Location & Work Model
This role is based in Dublin, Ireland and follows a hybrid working model.
Klaviyo supports work authorization and relocation for this position.
At Klaviyo, we enjoy tackling meaningful engineering challenges and value people who take ownership, learn continuously, and collaborate openly.
We are committed to building inclusive teams and encourage applications from candidates of all backgrounds.
Klaviyo is growing fast and we have openings for all skill levels across all of our teams.
Learn more about our engineering culture at https://klaviyo.tech
Please see the independent bias audit report covering our use of Covey here.
#J-18808-Ljbffr

Apply

Create an E-mail Alert

Save

Similar job

Site reliability engineer

Cork

Apple

Site reliability engineer

Similar job

Site reliability engineer (sre)

Dublin

Waratek Enterprise

Site reliability engineer

Similar job

Site reliability engineer

Dublin

Apple

Site reliability engineer