Overview
As a Lead Site Reliability Engineer, you’ll be at the forefront of building scalable, resilient, and observable systems that power Tricentis SaaS products globally. This is a hands-on engineering leadership role—balancing technical delivery, process ownership, and team mentorship.
You will drive initiatives across multiple products, shape SRE standards, and serve as a trusted partner to both engineering and product leaders. You will be responsible for elevating engineering quality and reliability while enabling scale and speed.
Responsibilities
* Lead and deliver cross-cutting initiatives to improve platform scalability, resilience, and cost efficiency.
* Architect and implement cloud-native infrastructure that supports multi-region, multi-tenant deployments.
* Improve observability strategy across systems and teams—including SLOs, error budgets, and alerting standards.
* Coach and mentor engineers, guiding technical design reviews and promoting engineering excellence.
* Own post-incident analysis and ensure learning loops are completed with preventive action.
* Influence product reliability from early-stage design to production readiness reviews.
* Establish and evolve standards for deployments, operational readiness, and incident response.
* Serve as a technical advisor for engineering and product managers across the org.
* Drive architectural discussions and make decisions that influence the SRE org and wider engineering teams.
* Define and evolve technical roadmaps and execution plans aligned with company goals.
* Partner with peers in security, infrastructure, and product to drive platform-wide improvements.
* Lead incident response for high-impact outages and continuously reduce incident recurrence.
* Contribute to SRE hiring through interviews, onboarding, and process refinement.
* Guide the adoption of modern tooling and practices across teams (e.g., GitOps, self-service platforms, chaos engineering).
* Represent SRE in leadership forums, bringing insights, trade-offs, and forward-looking strategies.
Our Tech Stack
AZURE, AWS, Terraform, GitHub Actions, Kubernetes, DataDog, Prometheus, Grafana, Betterstack, All-in-one incident management platform | incident.io, Jira and more
Our Culture
We don't just preach our values; we embody them in everything we do. We are committed to creating an environment that empowers, supports, and includes individuals, where trust, transparency, creativity, curiosity, and continuous improvement thrive on a daily basis.
About You
* 6+ years of experience in SRE, Infrastructure, or DevOps roles, including technical leadership.
* Expertise in building and operating production systems in public cloud (AWS or Azure).
* Deep understanding of observability principles (SLOs, SLIs, metrics, traces, logs).
* Strong experience with infrastructure-as-code, container orchestration, and CI/CD (Terraform, K8s, GitHub Actions).
* Proven track record in leading technical projects, influencing architecture, and mentoring engineers.
* Excellent communication and cross-functional collaboration skills.
* Proactive, ownership-driven mindset with a passion for reliability and continuous improvement.
Seniority level
* Mid-Senior level
Employment type
* Full-time
Job function
* Engineering and Information Technology
Industries
* Software Development
#J-18808-Ljbffr