Artificial Intelligence (AI)
Discuss current events in AI and technological innovations with Intel® employees
491 Discussions

Intel Works with AI Ecosystem to Accelerate ML Training Using Intelligent Ethernet Switches

Amedeo_Sapio
Employee
0 0 4,044

Today, AI is everywhere. And to reach an increasingly higher level of accuracy in real-world applications, machine learning (ML) models are becoming larger and are trained on huge datasets, so distributed training has become the norm. As the scale and size of deep learning increases, distributed deep learning is becoming communication-bound. Therefore, accelerating ML training with more efficient network communication—such as using intelligent Ethernet switches—is especially important for reducing training time and cost.

In a recent article, Meta (aka Facebook) and Amazon Web Services (AWS) demonstrated that simply adding more GPUs to an ML training cluster could not linearly scale throughput. In fact, more GPUs lead to per-GPU throughput degradation because of increased network communication overhead. In tests run by Meta and AWS, this communication overhead limited the maximum per-GPU throughput in a system with 128 GPUs to only 51% of the theoretical peak per-GPU performance, and increasing the number of GPUs further reduced per-GPU performance. What’s more, the same article states that when using 512 GPUs, communication overhead increased training time by up to 25.7 days and inflated the total training cost by up to USD 1.3 million (43% more in both time and dollars).[1]

As this benchmarking work illustrates, accelerating the network by making communication more efficient is an important part of the entire AI acceleration chain. Intel is working with Microsoft, the King Abdullah University of Science and Technology (KAUST), and the University of Washington to develop an intelligent network technology to accelerate ML training, called SwitchML, which is an open-source project recently launched with P4.org.

Intel is committed to advancing the world of intelligent Ethernet switches, such as the Intel® Tofino™ 2 and 3 IFP-based switches, which will provide up to 400 Gbps per port.Intel is committed to advancing the world of intelligent Ethernet switches, such as the Intel® Tofino™ 2 and 3 IFP-based switches, which will provide up to 400 Gbps per port.Unlike other networking solutions that require specialized and proprietary protocols such as InfiniBand, SwitchML can be used on industry-standard Ethernet infrastructure, which offers several advantages:

  • Interoperability with existing network components
  • Freedom of choice for network component suppliers
  • Ease of adoption

SwitchML works with popular AI frameworks like TensorFlow and PyTorch, with more ecosystem integrations to come.

SwitchML technology uses an intelligent P4-programmable Ethernet-based switching data plane to perform in‑network aggregation to reduce the volume of data being transferred during network synchronization phases of training. In turn, this reduces the overall latency in many-to-many communications in large networks. Using SwitchML makes it possible to offload aggregation to switches, offering the following benefits:

  • Reduce the volume of traffic on the data center fabric by up to 2.27x[2] on a 100 Gbps network (see figure below)
  • Reduce the amount of work performed by worker nodes
  • Perform aggregation at sub-round-trip-time (RTT) latency
  • Improve overall training time

SwitchML technology provides a speedup in training throughput up to 2.27x on a 100 Gbps network; the speedup is expected to be even higher with faster GPUs that reduce the computation/communication ratio.SwitchML technology provides a speedup in training throughput up to 2.27x on a 100 Gbps network; the speedup is expected to be even higher with faster GPUs that reduce the computation/communication ratio.

 

 

 

white-paper.jpg

 

Intel is committed to advancing the world of intelligent Ethernet switches, such as Intel® Tofino™ 2 and 3 Intelligent Fabric Processor (IFP)-based switches, which will provide up to 400 Gbps per port.

To learn more about how SwitchML technology works and how it can benefit ML training, read the white paper, "Accelerating Distributed Machine Learning on Standard Ethernet Infrastructure with SwitchML Technology”.

 

 

 

 

 

 

 


[1] The cost of the communication overhead is extrapolated from Figures 8 and 9 of Meta and AWS's article by comparing the measured performance with 512 GPUs to the ideal performance with linear scaling (i.e., the same per-GPU throughput as the 128 GPUs case).

[2] For full configuration and testing details, refer to the following paper: A. Sapio, M. Canini, C. Y. Ho, J. Nelson, P. Kalnis, C. Kim, A. Krishnamurthy, M. Moshref, D. Ports, & P. Richtarik. “Scaling Distributed Machine Learning with In-Network Aggregation.” In 18th USENIX Symposium on Networked Systems Design and Implementation 2021 (NSDI 21). usenix.org/conference/nsdi21/presentation/sapio