Tensorix is a sovereign AI infrastructure platform headquartered in Dublin.
We deploy and operate open-source large language models on EU-sovereign infrastructure across Europe, providing private, zero-retention inference for regulated industries including finance, healthcare and government.
Our platform offers drop-in OpenAI-compatible APIs, enabling developers and enterprises alike to adopt AI without compromising on data privacy, compliance or performance.
We are looking for a
Senior ML Systems Engineer (Inference)
to join our growing engineering team.
Reporting to the CTO, you will act as the technical owner of the model serving layer at the heart of our platform, from selecting and evaluating new open-source models to deploying, tuning and operating them in production on our on-prem GPU fleet.
This is a deeply hands-on role in a fast-moving scaleup where your work will directly shape the performance, cost and reliability of every token we serve.
You will work primarily with modern inference frameworks such as vLLM, SGLang and TensorRT-LLM, running on NVIDIA hardware across our on-prem estate, with supporting workloads on AWS.
You will benchmark frontier open-weight models as they release, quantify performance and cost trade-offs, and lead the technical side of our GPU procurement and capacity planning.
We are an AI-native team - tools such as Claude Code and Codex are part of our daily workflow and materially accelerate how we build and operate systems.
We value engineers who combine deep systems intuition with a pragmatic, research-aware mindset.
This is a high-impact senior individual contributor role spanning model serving, performance engineering and GPU infrastructure strategy.
Responsibilities
Model Deployment & Serving
- Deploy and operate open-source large language models in production using vLLM, SGLang, TensorRT-LLM and other high-performance serving frameworks.
Own the full lifecycle from model selection through to production rollout.
Performance Optimisation
- Profile and tune inference workloads for latency, throughput, memory efficiency and GPU utilisation.
Work across quantisation, batching strategies, KV cache management, tensor and pipeline parallelism and attention kernel selection.
Model Evaluation
- Benchmark new open-weight models as they release, running performance, quality and cost evaluations to inform which models we productise.
Maintain internal benchmarking tooling and an evidence-based view of the model landscape.
Hardware Planning & Procurement
- Lead capacity planning, GPU procurement and future hardware roadmap decisions.
Translate workload requirements into concrete hardware specifications, including GPU, interconnect, networking and storage.
Infrastructure & Operations
- Build and maintain the infrastructure that runs our model fleet, spanning containerised GPU workloads, orchestration, observability and autoscaling.
Operate across on-prem GPU clusters and AWS where appropriate.
Reliability & Observability
- Instrument the serving stack with meaningful metrics covering tokens per second, time-to-first-token, tail latency, GPU utilisation and cost per token.
Drive incident response and post-incident improvements.
Research & Experimentation
- Track developments in inference optimisation, serving architectures and model efficiency.
Prototype new techniques such as speculative decoding, prefix caching, disaggregated prefill/decode and emerging quantisation methods.
Collaboration & Knowledge Sharing
- Partner closely with platform, product and customer-facing teams.
Participate in architectural decisions, share findings and help raise the collective bar across the engineering team.
Skills & Experience
5+ years of professional experience (or equivalent depth of expertise) in ML infrastructure, systems engineering or a closely related discipline, with a meaningful portion focused on production ML workloads
Hands-on experience deploying and tuning large language models with modern inference frameworks such as vLLM, SGLang, TensorRT-LLM and similar high-performance inference systems
Strong working knowledge of GPU architecture, CUDA fundamentals and the performance characteristics of modern NVIDIA hardware (H100, H200 and B300-class hardware)
Practical experience with inference optimisation techniques including quantisation (AWQ, GPTQ, FP8), continuous batching, KV cache strategies and tensor/pipeline parallelism
Proficiency in Python and comfort reading and contributing to systems-level code in the broader inference ecosystem
Solid experience with Linux, containerisation and orchestration of GPU workloads
Familiarity with benchmarking methodology and the ability to design experiments that produce defensible, reproducible results
Comfortable using AI-assisted development tools (e.g. Claude Code, Codex) as part of your daily workflow
A clear and concise communicator who thrives in ambiguity and can articulate technical decisions to both technical and non-technical audiences
Nice to Have
Experience with Kubernetes and GPU scheduling in multi-tenant environments
Exposure to distributed training or fine-tuning workflows, even if your primary focus is inference
Experience with AWS infrastructure and related services (e.g. EC2, ECS, EKS, S3)
Familiarity with alternative accelerators (AMD Instinct, Intel Gaudi) or emerging inference hardware
Contributions to open-source inference projects such as vLLM, SGLang or related tooling
Exposure to Golang or Rust for systems-level work
Education & Qualifications
BSc/MSc in Computer Science, Software Engineering, Electrical Engineering
OR
a related technical discipline OR equivalent practical experience
Remuneration
Highly competitive package, dependent on experience
25 days paid annual leave
Hybrid working from our centrally located Dublin office, with remote flexibility
Free inference tokens!
#J-*****-Ljbffr