Accelerating Graph Neural Network Training on Intel CPU through Fused Sampling & Hybrid Partitioning

Hesham_Mostafa · ‎10-30-2023

Authors: Hesham Mostafa (Intel Labs), Adam Grabowski (AIA)

Highlights

Intel Labs and AIA developed a new graph sampling method called “fused sampling” that achieves up to 2x speedup in training Graph Neural Networks (GNNs) on CPUs. The new sampling pipeline is now part of the Deep Graph Library (DGL), one of the most popular libraries for training GNNs.
Intel Labs developed a new graph partitioning method, “hybrid partitioning,” that achieves significant speedups in distributed training of Graph Neural Networks (GNNs) on large graphs, achieving up to 30% reduction in epoch times on popular graph benchmarks.
The combination of fused sampling and hybrid partitioning has set a new CPU record for training GNNs on the popular ogbn-papers100M benchmark, achieving a total FP32 training time of just 1.6 minutes on 16 2-socket machines, each equipped with two 4th Gen Intel Xeon Scalable Processors (Sapphire Rapids)

Graph Neural Networks (GNNs) have set state-of-the-art performance in many graph-related tasks, such as link prediction in recommendation graphs, predicting physical properties of molecular graphs, and predicting high-level semantic features of nodes in citation graphs and social graphs. Graphs in many domains can grow to include millions of nodes and billions of edges. Training on the entire graph at once can quickly exhaust available memory. One of the most popular methods for training GNNs on large graphs is sampling-based training: in each training iteration, we randomly sample a small part of the graph (small enough to fit in available memory) and train the GNN on this graph sample as illustrated in Fig. 1. The time to sample the graph each iteration, however, can quickly overshadow the time spent on the GNN’s forward and backward passes as shown in Fig. 2.

To accelerate sampling-based training, the graph is often partitioned across several machines, as illustrated in Fig. 3. Each machine is responsible for generating its own graph samples and training the GNN model on these samples. Since the graph topology is partitioned across the machines, each machine would need to communicate with other machines in order to construct a graph sample. This communication cost will grow as we create larger graph samples. The size of the graph sample typically increases when the GNN model has more layers.

Figure 1. Randomly sampling a small subgraph from a large graph in order to train a GNN on the small, sampled graph

In what follows, we describe two complementary approaches that address the large CPU sampling overhead currently incurred by popular machine learning libraries as well as the large communication overhead involved in distributed sampling-based training.

1. Fused Sampling

Graph sampling has to be done during each training iteration. It is thus imperative that graph sampling is done as fast as possible. The typical sampling pipeline as implemented in popular GNN libraries, such as DGL (a popular GNN training library), involves multiple steps that each generate intermediate tensors that have to be written to, and then read from, memory.

Figure 2. Flame graph showing the fraction of time spent on graph sampling, forward pass, and backward pass. We used DGL and trained on a 2-socket 3rd Generation Intel® Xeon® processor.

Figure 3. Partitioning of a toy graph (with nine nodes) to enable distributed training. Traditional distributed GNN training libraries partition both the graph topology and the node features.

We developed a new sampling pipeline (Details in this Request for Comments (RFC):https://github.com/dmlc/dgl/issues/5328 and this Pull Request (PR): https: //github.com/dmlc/dgl/pull/5924 ) that merges together the sampling steps into one optimized kernel. This results in large speedups in the graph sampling step, as shown in Figures 4 and 5. As for total epoch time (sampling + GNN training), fused sampling results in significantly faster total epoch times, as illustrated in Fig. 5 for different sampling parameters and numbers of GNN layers.

2. Hybrid partitioning

When a graph is too big to fit in the memory of a single training machine, the graph is often partitioned across multiple machines, and inter-machine communication is used to request and provide the relevant graph data needed by each machine to train the GNN model. We have observed that, oftentimes, the features associated with the graph nodes take up the bulk of the graph representation size. Often, the node features take up more than 90% of the memory needed to represent the graph.

Guided by this observation, we designed a new partitioning strategy, hybrid partitioning, that replicates the relatively small graph topology information (the graph’s adjacency matrix) across all training machines, and only partitions the graph’s node features, as illustrated in Fig. 6. In distributed sampling based GNN training experiments, this leads to a large reduction in the number of communication rounds, as the machines only need to exchange node features.

The combination of fused sampling and hybrid partitioning led to a large reduction in epoch times for distributed sampling-based GNN training, as illustrated in Fig. 7. Hybrid partitioning alone boosts performance and its combination with fused sampling leads to more than a 2x improvement in epoch times. Using hybrid partitioning and fused sampling, we achieve a record-setting total FP32 training time of just 1.6 minutes on 16 2-socket machines.

Figures 4(a) and 4(b). Acceleration of the graph sampling time for two popular graph benchmarks. The baseline (DGL concurrent hashes) is the optimized sampling implementation in DGL.

Figure 5. Speedups of total epoch time for ogbn-papers100M when using different sampling parameters (sampling fanouts and batch sizes).

Figure 6. Hybrid partitioning of the graph. The graph topology is replicated across machines while node features are partitioned.

Figure 7. Epoch times for distributed sampling-based GNN training on ogbn-papers100M on 8 and 16 machines. We used a 3-layer GraphSage model with hidden layer dimension 256. Each machine is a 2-socket machine equipped with two 4th Gen Intel Xeon Scalable Processors. All machines were connected

Kinga_Rutkowska · ‎11-13-2023

Great work @Hesham_Mostafa and @adam_grabowski !