The Performance Assured Networking organization (PAN) owns delivering high performance networks for running ML workloads with specialized network products and a custom control plane solution to meet the scale, performance and availability needs of such workloads. The organization owns five inter-related product portfolios. First is the ML network and the network connectivity service it provides to ML servers. AWS Intent Driven Networking (AIDN) is our control plane in which network routing and forwarding behaviors, called Intents, can be programmed across an entire network using highly available APIs. AIDN uses closed-loop actors to program network devices and ensure that the network is in sync with the specified Intent. Third, SIDR (Scalable Intent Driven Routing), our only AIDN actor in production, is a fabric routing protocol and a network controller system, that leverages the prescriptive nature of our networks allowing topology, prefixes, and policy to be controlled using Intents. SIDR harnesses a multi-phase commit mechanism (MPC) with built-in rollback to distribute and atomically enable administrative changes across a single fabric. It also provides rapid responses to network events within the fabric, minimizing customer impact. Fourth is a set of safety systems that assures that changes being rolled out to the fabric will not cause customer impact. Fifth is AWACS, a set of off-the-box services that enables WCMP-based traffic engineering in existing DC fabrics to increase effective capacity of the CLOS network and provide capacity safety for shared failure domains. All of the products and services described above are operational. Each are in different stages of expansion and new capabilities.
Key job responsibilities
This Principal Engineer will take ownership of ML network performance dependent on the EC2 interface, a critical capability that directly impacts our customers' ability to train and deploy ML models efficiently. In the immediate term, they'll tackle one of our most pressing challenges: building a comprehensive understanding of network performance for ML workloads in production. This means designing and implementing systems that can intelligently measure and baseline performance without direct visibility into customer applications.
Over the next 12-18 months, they'll need to transform how we approach ML networking. This starts with developing new ways to identify and classify network traffic patterns from ML training, building systems that can automatically tune network configurations based on observed workload characteristics. They'll architect flexible abstractions that allow us to quickly adapt to new ML training patterns while maintaining peak performance for existing workloads.
The role requires someone who can move from theoretical understanding to practical implementation. They'll need to deliver a production-grade telemetry system that provides actionable insights about network performance, develop new approaches to baseline measurements, and demonstrate concrete performance improvements for key ML workloads. Success in this role means not just solving today's performance challenges, but building systems flexible enough to handle tomorrow's ML innovations.
This PE will be the technical authority for ML networking performance at AWS, working across teams to drive adoption of their approaches and establishing best practices that will shape how we build and operate our ML infrastructure for years to come.· A Masters Degree in Computer Science or Engineering, or equivalent experience is mandatory.
· Excellent IP networking fundamentals and extensive experience in the application of IP protocols.
· Expertise with major internet routing protocols; specifically, BGP, OSPF, MPLS, RSVP and ISIS
· Expertise with major router platforms; specifically, a deep technical understanding of all internal hardware components and experience with router system design.
· Expert level network analysis fundamentals and robust troubleshooting skills; specifically, network performance analysis.
· Ability to lead teams of engineers to deliver large scale solutions.
· Excellent written and verbal communication skills and an ability to interact efficiently with peers and customers is required.
• Deep expertise in RDMA technologies (RoCEv2, EFA, InfiniBand)
• Strong understanding of ML training patterns and NCCL internals
• Experience with large-scale performance measurement systems
• Knowledge of ML frameworks and their distributed training implementations
• Expertise in network protocol design and optimization
Amazon is an equal opportunities employer. We believe passionately that employing a diverse workforce is central to our success. We make recruiting decisions based on your experience and skills. We value your passion to discover, invent, simplify and build. Protecting your privacy and the security of your data is a longstanding top priority for Amazon. Please consult our Privacy Notice (https://www.amazon.jobs/en/privacy_page) to know more about how we collect, use and transfer the personal data of our candidates.
Amazon is an equal opportunity employer and does not discriminate on the basis of protected veteran status, disability, or other legally protected status.
Our inclusive culture empowers Amazonians to deliver the best results for our customers. If you have a disability and need a workplace accommodation or adjustment during the application and hiring process, including support for the interview or onboarding process, please visit https://amazon.jobs/content/en/how-we-hire/accommodations for more information. If the country/region you're applying in isn't listed, please contact your Recruiting Partner.