Senior infrastructure engineer (gpu cloud)

Dublin

Uniting Holding

Infrastructure engineer

Posted: 23 May

Offer description

Dublin, Ireland Full Time SeniorOverviewTensorix is a sovereign AI infrastructure platform headquartered in Dublin. We deploy and operate open-source large language models on EU-sovereign infrastructure across Europe, providing private, zero-retention inference for regulated industries including finance, healthcare and government. Our platform offers drop-in OpenAI-compatible APIs, enabling developers and enterprises alike to adopt AI without compromising on data privacy, compliance or performance.We are looking for a Senior Infrastructure Engineer (GPU Cloud)to join our growing engineering team. Reporting to the CTO, you will own the physical and virtualisation layer that underpins our GPU fleet, from bare-metal server deployment through to multi-tenant GPU-as-a-Service delivery. You will design, build and operate the compute, storage and network infrastructure that our ML serving layer runs on.We currently operate Dell PowerEdge XE9780 servers with NVIDIA B300 SXM6 GPUs across our European estate, with an aggressive near-term growth plan spanning multiple sites across the EU. You will be the technical owner of our hardware strategy, cluster architecture and infrastructure automation as we build out a sovereign GPU cloud platform.This is a deeply hands-on role. You will oversee server deployments, configure firmware, debug PCIe bus issues, design network fabrics, architect storage and build the orchestration layer that ties it all together. We are an AI-native team and tools such as Claude Code and Codex are part of our daily workflow, materially accelerating how we build and operate systems. We value engineers who combine deep systems knowledge with a pragmatic, builder mindset.This is a high‑impact senior individual contributor role spanning bare‑metal infrastructure, GPU cluster architecture and multi‑site estate planning.ResponsibilitiesBare‑Metal & Firmware Management — Lead the deployment and maintenance of GPU server fleets including BIOS configuration, iDRAC/BMC management, firmware updates (GPU baseboard, FPGA, CPLD, PCIe switches), kernel parameter tuning and driver stack management for NVIDIA B‑series GPUs.Hypervisor & Virtualisation — Design and operate our virtualisation layer using Proxmox VE or OpenStack, including GPU passthrough (VFIO‑PCI), NVSwitch fabric management through host-level services and multi‑tenant GPU allocation.Network Architecture — Design and maintain the network fabric across multiple racks and sites, spanning management VLANs, storage networks, tenant data planes and GPU interconnect. Work with bonded NICs, jumbo frames, InfiniBand and ConnectX adapters. Plan the network topology for multi‑site EU deployments.Storage Architecture — Design, deploy and operate shared storage infrastructure across multiple racks and sites, including SAN (Dell ME‑series, iSCSI multipath), NFS and local NVMe. Optimise for large model weights (hundreds of GB per model), high‑throughput sequential reads and cross‑site replication. Own SAN performance tuning, capacity planning and data placement strategy as the estate grows.GPU‑as‑a‑Service Platform — Build the infrastructure layer for multi‑tenant GPU delivery, including tenant isolation, resource scheduling, capacity planning and usage metering. Design the platform so customers can consume GPU resources via API without touching the underlying hardware.Cluster Orchestration & Automation — Automate server provisioning, OS deployment, driver installation and cluster configuration. Build infrastructure‑as‑code for repeatable, auditable deployments across multiple sites.Monitoring & Reliability — Instrument the infrastructure stack with monitoring covering GPU health, NVSwitch fabric status, storage throughput, network utilisation and hardware telemetry (DCGM, iDRAC, IPMI). Own incident response for hardware and infrastructure faults.Hardware Strategy & Estate Planning — Work with the CTO to plan GPU procurement cycles, evaluate server platforms, specify network and storage hardware and manage vendor relationships. Design the infrastructure blueprint for new EU datacentre deployments, defining standard rack layouts, power and cooling requirements, network topology and storage architecture that can be replicated across sites with minimal variance.Security & Compliance — Ensure infrastructure meets the requirements of regulated industries including data residency, tenant isolation, encryption at rest and in transit and audit logging. Support EU sovereignty requirements across our deployment sites.Skills & Experience5+ years of professional experience in infrastructure engineering, systems administration or datacentre operations, with a meaningful portion involving GPU or HPC infrastructureHands‑on experience with bare‑metal Linux server deployment and management, including kernel tuning, driver management, PCI device configuration and UEFI/BIOS configurationStrong working knowledge of NVIDIA GPU server platforms, including driver installation, NVLink/NVSwitch fabric, Fabric Manager, DCGM and GPU passthrough via VFIOExperience with virtualisation platforms, ideally Proxmox VE or OpenStack, including PCI passthrough for GPU workloadsSolid understanding of network design including VLANs, bonding/LACP, jumbo frames, InfiniBand and routing in multi‑rack environmentsExperience with enterprise storage including SAN (iSCSI, FC), NFS, multipath I/O and performance tuning for large sequential workloadsProficiency with Linux (Ubuntu Server and/or Debian), systemd, networking stack (ip, nmcli, netplan) and shell scriptingExperience with infrastructure‑as‑code and automation tooling (Ansible, Terraform or similar)Comfortable using AI‑assisted development tools (e.g. Claude Code, Codex) as part of your daily workflowMethodical approach to troubleshooting with the ability to work across firmware, kernel, driver and userspace layers to diagnose complex hardware issuesNice to HaveExperience building or operating GPU cloud / GPU‑as‑a‑Service platformsFamiliarity with Dell PowerEdge server management (iDRAC, Redfish API, racadm, Dell SupportAssist)Experience with NVIDIA ConnectX network adapters and OFED/MOFED stackExposure to Kubernetes with GPU scheduling (NVIDIA GPU Operator, device plugins, MIG)Experience with MAAS, Ironic or other bare‑metal provisioning systemsKnowledge of InfiniBand fabric management for multi‑node GPU training clustersFamiliarity with European data sovereignty and compliance frameworks (GDPR, DORA, NIS2)Contributions to open-source infrastructure projectsEducation & QualificationsBSc/MSc in Computer Science, Software Engineering, Electrical Engineering, Network Engineering or a related technical discipline OR equivalent practical experienceRemunerationHighly competitive package, dependent on experience25 days paid annual leaveHybrid working from our centrally located Dublin office, with remote flexibilityOccasional travel to our EU datacentre sites as the estate growsFree inference tokens!
#J-18808-Ljbffr

Apply

Create an E-mail Alert

Save

Similar job

Gpu infrastructure engineer: fleet automation & diagnostics

Dublin

Crusoe

Infrastructure engineer

Similar job

Senior water infrastructure engineer — lead design

Dublin

Stantec

Infrastructure engineer

Similar job

Gpu infrastructure engineer - automation & fleet reliability

Dublin

Crusoe

Infrastructure engineer