Role Overview:We are seeking a highly experienced Principal Platform Engineer to design, build, and operate secure, scalable, and highly reliable cloud platforms. This role sits at the intersection of platform engineering, site reliability engineering (SRE), and infrastructure security, supporting mission‑critical distributed systems including financial services and blockchain‑based platforms. You will lead the development of resilient multi‑cloud infrastructure, drive reliability and observability standards, and enable engineering teams through self‑service platforms, automation, and GitOps‑based delivery models.Key Responsibilities:Platform Engineering & ArchitectureDesign and operate large‑scale, multi‑region infrastructure across AWS, GCP, and AzureBuild and evolve Kubernetes platforms (EKS, AKS, GKE) for high‑availability production workloadsDefine platform standards, golden paths, and reusable infrastructure patternsArchitect secure environments, including confidential computing and enclave‑based systemsPerform deep troubleshooting across Linux kernel, networking stack, storage, and system performance layersOptimize systems for low‑latency and high‑throughput workloads (CPU pinning, NUMA awareness, IRQ tuning, disk I/O optimization)Diagnose and resolve complex production issues using system‑level tools (e.g., perf, eBPF, strace, tcpdump)Tune OS‑level parameters for containerized and distributed environmentsReliability Engineering (SRE)Define and implement SLOs/SLIs and drive reliability improvements across servicesLead incident response, post‑incident reviews, and systemic resilience improvementsImprove MTTR through observability, automation, and operational excellence practicesConduct failure‑mode analysis, chaos testing, and capacity planningInfrastructure as Code & DeliveryBuild fully automated infrastructure using Terraform, Terragrunt, and related toolingImplement GitOps workflows using tools like Argo CDDevelop secure CI/CD pipelines with policy enforcement, provenance, and gated releasesEnable zero‑touch deployments and self‑service developer platformsObservability & MonitoringDefine and implement observability strategies across metrics, logs, and tracesWork with tools such as Datadog, Prometheus, and OpenTelemetryImprove alert quality, reduce noise, and build actionable runbooksDrive adoption of distributed tracing and end‑to‑end visibilityRequired Skills & Experience10+ years in Platform Engineering, SRE, DevOps, or Linux Systems Engineering rolesDeep expertise in Kubernetes (EKS, AKS, GKE) and cloud‑native architecturesStrong Linux systems knowledge, including kernel behavior, networking, and performance tuningProven experience in multi‑cloud environments (AWS, GCP, Azure)Proven track record operating production systems with high availability (99.9%+)Hands‑on experience with Infrastructure as Code (Terraform, Terragrunt)Strong understanding of observability, monitoring, and incident responseExperience implementing GitOps and modern CI/CD pipelinesProgramming/scripting experience (Go, Python, or Bash)
#J-18808-Ljbffr