Overview
We are seeking a highly technical Senior Platform Engineer with deep expertise in Linux Engineering, OpenStack development, Kubernetes, and GPU-enabled infrastructure to design, build, and operate SIG’s next-generation infrastructure platforms supporting trading and core technology environments.
This is a hands‑on engineering role focused on building and tuning scalable, resilient, and high‑performance infrastructure systems across CPU and GPU workloads. The ideal candidate will have strong Linux internals knowledge, experience developing and operating cloud‑native platforms, and a deep understanding of distributed systems architecture, including the efficient provisioning, isolation, and performance tuning of accelerator‑based compute resources.
What We're Looking For
Linux Systems Engineering
Deep troubleshooting across kernel, networking stack, storage, and performance layers.
Performance tuning for low‑latency systems (CPU pinning, NUMA, IRQ balancing, kernel tuning).
Develop automation using Python, Go, or similar languages.
Build and maintain infrastructure tooling and internal platform services.
Implement high‑availability solutions and disaster recovery strategies.
Perform root cause analysis for production incidents affecting distributed systems.
Design, deploy, and operate GPU‑enabled infrastructure. Optimize GPU utilization (memory bandwidth, PCIe throughput, multi‑process service, MIG partitioning where applicable).
Tune workloads to efficiently leverage NVIDIA GPUs (or equivalent accelerators) for compute‑intensive applications.
Troubleshoot GPU driver, CUDA, kernel module, and firmware‑related issues in production environments.
OpenStack Development & Cloud Infrastructure
Develop and extend OpenStack services (Nova, Neutron, Cinder, Keystone, etc.).
Build custom integrations and automation around OpenStack APIs.
Optimize compute, networking, and storage performance for high‑performance workloads.
Design multi‑tenant OpenStack architectures with strong isolation and security.
Contribute to infrastructure‑as‑code frameworks managing OpenStack environments.
Debug and resolve deep issues across hypervisors (KVM), networking layers, and control plane services.
Integrate OpenStack environments with Kubernetes platforms (hybrid cloud architectures).
Kubernetes Platform Engineering
Design, build, and operate highly available, production‑grade Kubernetes clusters.
Develop and maintain Kubernetes operators, controllers, and custom resource definitions (CRDs).
Implement advanced scheduling, multi‑tenancy, and workload isolation strategies.
Optimize cluster performance for low‑latency and high‑throughput workloads.
Integrate Kubernetes with CI/CD pipelines and GitOps workflows.
Implement cluster observability using Prometheus, Grafana, OpenTelemetry, etc.
Design and enforce networking policies (CNI), ingress architecture.
Implement secure cluster design including RBAC, OPA/Gatekeeper, secrets management, and runtime security.
Automation & Infrastructure as Code
Design and maintain infrastructure using Terraform, Ansible, Helm, or similar tools.
Build CI/CD pipelines for infrastructure and platform deployments.
Implement immutable infrastructure and GitOps methodologies.
Create automated validation, testing, and deployment frameworks for platform services.
Required Technical Skills
Advanced Linux systems knowledge (kernel, networking, storage)
Experience deploying and operating GPU‑enabled Linux servers
Understanding of CUDA drivers, GPU kernel modules
Performance profiling and tuning workloads for compute‑intensive applications.
Hands‑on OpenStack development and operations experience
Strong experience administering and engineering production Kubernetes clusters
Strong understanding of distributed systems principles:
Consensus
Replication
Fault tolerance
CAP theorem tradeoffs
Experience with
Python or similar programming languages
Infrastructure as Code (Terraform, Ansible)
Container runtimes (containerd, CRI‑O)
Observability stacks (Prometheus, Grafana, ELK)
Desirable Experience
Experience in low‑latency or high‑performance trading environments
High‑performance networking (DPDK, SR‑IOV, CNI tuning)
Storage systems (Ceph, distributed storage, NVMe optimization)
Contribution to open‑source projects (Kubernetes, OpenStack)
Experience designing multi‑region or hybrid cloud architectures
Experience tuning AI/ML, quantitative, or high‑performance compute workloads on GPUs
Experience with NVIDIA DCGM, MIG (Multi‑Instance GPU), or vGPU configurations
Familiarity with RDMA, GPUDirect, or high‑throughput interconnects
Experience optimizing containerized ML or compute pipelines
Key Attributes
Strong systems thinking and deep technical curiosity
Ability to diagnose complex cross‑layer failures
Passion for building reliable, scalable distributed systems
Comfortable operating in high‑availability, high‑performance production environments
Strong documentation and knowledge‑sharing mindset
#J-18808-Ljbffr