My client is building next-generation AI and high-performance computing platforms that power advanced machine learning, data science, and large-scale compute workloads. We operate high-density GPU clusters and are looking for a Principal Linux Engineer to lead the design, optimization, and reliability of our GPU-based infrastructure.
As a Principal Linux Engineer specializing in GPU systems, you will architect, deploy, and operate high-performance Linux environments optimized for GPU workloads including AI/ML training, inference, simulation, and data processing. You will work closely with ML engineers, platform teams, and DevOps to ensure performance, scalability, and reliability across our compute infrastructure. This is a hands‑on technical leadership role requiring deep Linux expertise and strong experience managing GPU-based systems at scale.
Key Responsibilities
Architect and maintain enterprise‑grade Linux systems (RHEL, Rocky, Ubuntu, or equivalent)
Kernel tuning and performance optimization for HPC and GPU workloads
Develop automation for provisioning and lifecycle management
Troubleshoot complex OS‑level, hardware, and performance issues
GPU Infrastructure & Performance
Deploy and manage NVIDIA GPU infrastructure (A100, H100, or equivalent)
Install, configure, and maintain NVIDIA drivers, CUDA, NCCL, and related libraries
Optimize multi‑GPU and multi‑node performance
Monitor GPU utilization, thermals, and power efficiency
Diagnose PCIe, NVLink, NUMA, and memory bottlenecks
Manage large‑scale compute clusters (on‑prem or cloud)
Integrate GPUs into Kubernetes environments (GPU operator, device plugins)
Automation & Infrastructure as Code
Build infrastructure using Terraform, Ansible, or similar
Develop CI/CD workflows for system configuration
Automate GPU fleet provisioning and configuration management
Reliability & Observability
Establish SLOs and capacity planning models
Lead incident response for infrastructure outages
Conduct root cause analysis and implement preventive measures
Security & Compliance
Harden Linux systems using security best practices
Implement access controls, patch management, and vulnerability remediation
Support SOC2 / ISO27001 / FedRAMP initiatives (if applicable)
Required Qualifications
7+ years of Linux systems engineering experience
3+ years managing GPU infrastructure in production environments
Deep knowledge of:
Linux internals (kernel, memory management, networking stack)
NVIDIA driver stack, CUDA, and GPU troubleshooting
High‑performance storage (NVMe, parallel file systems)
Networking (10/25/40/100GbE, InfiniBand preferred)
Experience with:
Kubernetes with GPU workloads
Infrastructure as Code (Terraform, Ansible)
Python or Bash scripting
Strong debugging and performance analysis skills
Experience operating in large‑scale production environments
#J-18808-Ljbffr