ML Network Performance Specialist
">
The Performance Assured Networking organization (PAN) is responsible for delivering high-performance networks for running ML workloads with specialized network products and a custom control plane solution to meet the scale, performance, and availability needs of such workloads. This organization owns five inter-related product portfolios.
* The Ml network and the network connectivity service it provides to Ml servers
* AWS Intent Driven Networking (AIDN), our control plane in which network routing and forwarding behaviors can be programmed across an entire network using highly available APIs
* SIDR (Scalable Intent Driven Routing), our only AIDN actor in production, is a fabric routing protocol and a network controller system, that leverages the prescriptive nature of our networks allowing topology, prefixes, and policy to be controlled using Intents
* A set of safety systems that assures that changes being rolled out to the fabric will not cause customer impact
* AWACS, a set of off-the-box services that enables WCMP-based traffic engineering in existing DC fabrics to increase effective capacity of the CLOS network and provide capacity safety for shared failure domains
This role involves optimizing network performance for ML workloads. The ideal candidate will have expertise in RDMA technologies, such as RoCEv2, EFA, and InfiniBand, as well as a strong understanding of ML training patterns and NCCL internals. They will need to deliver a production-grade telemetry system that provides actionable insights about network performance.
Responsibilities:
1. Design and implement systems that can intelligently measure and baseline performance without direct visibility into customer applications
2. Develop new ways to identify and classify network traffic patterns from ML training
3. Build systems that can automatically tune network configurations based on observed workload characteristics
4. Architect flexible abstractions that allow us to quickly adapt to new ML training patterns while maintaining peak performance for existing workloads
Requirements:
* A Masters Degree in Computer Science or Engineering, or equivalent experience
* Excellent IP networking fundamentals and extensive experience in the application of IP protocols
* Expertise with major internet routing protocols; specifically, BGP, OSPF, MPLS, RSVP, and ISIS
* Expertise with major router platforms; specifically, a deep technical understanding of all internal hardware components and experience with router system design
About Us:
We are committed to creating a diverse and inclusive workplace where everyone can thrive. If you share our passion for innovation and diversity, please visit our website to learn more about our company culture and benefits.
],