Ai engineer

Waterford

Ibm

Ai engineer

Posted: 8 February

Offer description

Introduction
Introduction
At IBM, work is more than a job — it's a calling to build, design, code, and make things better for people around the world.
IBM Infrastructure is seeking an
experienced AI Engineer
to help bring
Large Language Models (LLMs)
to
IBM Z (System z)
, one of the most secure and reliable enterprise computing platforms in the world.
This role is intended for professionals with
3+ years of experience
in AI/ML systems, performance engineering, or accelerator-based inference who are interested in working close to the hardware and across multiple layers of the technology stack.
You will help enable generative AI for mission-critical workloads used by banks, healthcare providers, and government agencies worldwide.
Who This Role Is For
This Role Is Ideal For Engineers Who
Have delivered or supported production AI or ML systems
Enjoy working across hardware, system software, and applications
Are motivated by solving performance- and reliability-critical problems
Want to help define how enterprise-scale AI runs on mission-critical platforms
Your Role And Responsibilities
As an AI Engineer on the IBM Z team, you will contribute directly to the design, integration, and operation of LLM workloads on enterprise infrastructure.
This role is suited to engineers who enjoy solving complex system-level problems and collaborating across hardware and software domains.
LLM Integration and Deployment
Develop and integrate LLM inference workloads on IBM Z using Spyre hardware accelerator cards.
Implement model loading, runtime integration, memory management, and resource allocation strategies optimized for the IBM Z architecture.
Enable both traditional mainframe applications and modern cloud-native services to access LLM capabilities through well-defined APIs.
Performance Profiling and Optimization
Profile LLM inference workloads to measure latency, throughput, memory usage, and power efficiency.
Analyze performance data to identify bottlenecks and optimization opportunities across hardware utilization, kernels, memory access patterns, and batching strategies.
Document findings and contribute to performance best practices and internal guidance.
Failure Analysis and Debugging
Diagnose and resolve inference errors, performance regressions, and system-level issues across firmware, drivers, runtimes, and applications.
Collaborate with hardware engineers, firmware developers, and system architects to identify root causes and implement durable solutions.
Contribute to automated testing and regression detection to improve system reliability.
Observability and Telemetry
Design and implement monitoring and telemetry for production LLM workloads.
Instrument systems and deploy logging to capture model performance, hardware utilization, error rates, and system health.
Create dashboards and alerts to support operational teams with real-time visibility and historical analysis.
Collaboration and Technical Leadership
Participate in architecture reviews and technical discussions across AI, hardware, firmware, and system software teams.
Produce clear technical documentation and share knowledge across the organization.
Stay current with advances in LLMs, hardware acceleration, and inference optimization, and apply learnings to improve IBM Z AI capabilities.
Education and Experience
Preferred Education
Bachelor\'s Degree
Required Technical And Professional Expertise
Demonstrated Professional Experience in AI/ML engineering, ML systems, platform engineering, or performance-focused software development.
Strong programming skills in Python and working experience with C/C++.
Solid understanding of machine learning fundamentals, particularly transformer-based models and inference workflows.
Knowledge of computer architecture, including memory hierarchies, parallel processing, and I/O systems.
Experience working in Linux environments, using command-line tools and scripting.
Hands-on experience with profiling, performance analysis, and debugging of complex systems.
Familiarity with monitoring, logging, and observability concepts.
Strong problem-solving skills and the ability to communicate technical concepts clearly.
Preferred Technical And Professional Experience
Experience with PyTorch, TensorFlow, or Hugging Face Transformers.
Exposure to hardware acceleration technologies such as GPUs or AI accelerators.
Familiarity with model optimization techniques (quantization, pruning, knowledge distillation).
Knowledge of inference frameworks such as ONNX Runtime, TensorRT, or TorchServe.
Experience with observability platforms including Prometheus, Grafana, ELK, or Splunk.
Understanding of distributed tracing (OpenTelemetry, Jaeger).
Working knowledge of Docker, Kubernetes, and CI/CD pipelines.
Exposure to IBM Z, z/OS, or enterprise computing environments (beneficial but not required).
Experience working in environments with high requirements for reliability, security, and performance.
#J-*****-Ljbffr

Apply

Create an E-mail Alert

Save

Similar job

Principal ai engineer

Waterford

LSEG

Ai engineer

Similar job

Ai engineer

Waterford

IBM

Ai engineer

Similar job

Principal ai engineer

Waterford

LSEG

Ai engineer