Job Description:
We're seeking an experienced professional to design and implement high-performance storage systems for GPU clusters, ensuring large checkpoints are handled efficiently and low-latency context preemption and reloads are achieved.
The ideal candidate will have a strong background in HPC systems, C/C++ programming, and Linux kernel and OS internals. Experience with Lustre, parallel filesystems, and container orchestration tools like Kubernetes is highly desirable.
Requirements:
* Masters or PhD in Electrical Engineering (EE) or Computer Science (CS)
* Over 5 years of experience building High-Performance Computing (HPC) systems
* Strong understanding of RDMA and RoCE V2 protocols
* Hands-on experience with GPUs and AI workflows
* Familiarity with AI/ML Python frameworks (TensorFlow, PyTorch)
* C/C++ Programming skills for performance-critical components and integration tasks
* Lustre (Parallel Filesystems) knowledge with a strong preference for experience in Lustre or similar distributed filesystems
* Kubernetes experience for container orchestration and management at scale
What We Offer:
A dynamic work environment that encourages innovation and collaboration. Opportunities for professional growth and development. A competitive compensation package and benefits.
Please note that all qualified applicants will receive consideration for employment without regard to race, color, religion, national origin, sex, sexual orientation, gender identity, veteran status, or disability.