A Sparse and Irregular GEMM Accelerator with Flexible Interconnects for DNN Training
Increasing performance on Deep Learning (DL) workloads is important to the advancement of artificial intelligence. DL has two phases: training and inference, both of which are run via a Deep Neural Network (DNN). During training, algorithms build mathematical models based on sample data. After training is complete, the inference phase analyzes data to make predictions based on the models created. Both training and inference are demanding processes that require significant compute resources, and these demands have grown steadily since the early days of AI. This has led to the use of purpose-built accelerators, TPUs and GPUs for AI workloads.
The single most commonly used algorithmic function in DL training is called “GEMM”—General Matrix-to-Matrix Multiplication. Part of the Basic Linear Algebra Subprograms Library, GEMMs do what their name says: they multiply one input matrix by a second input matrix to produce a third, output matrix. GEMMs are of high importance to just about anyone working in scientific computing today, and they play a critical role in DL due to the never-ending challenge to improve computational performance.
But GEMMs are not perfect. In fact, GEMMs used in DL suffer from three problems that leave significant room for improvement:
- The first problem is data irregularity. Using GEMMs, data is analyzed in what are called systolic arrays. Systolic arrays are popular because they enable efficient data reuse and are simple to implement. Data within arrays may be arranged in squares and, as a result, have an aspect ratio such as a 2x2, or 8x8. But data may be irregular (for instance, a 2 x 10 grid, etc.). Data in DL operations that is quite irregular doesn’t map cleanly into these arrays (as shown in Figure 1.). During computation, many “empty” cells are handled, resulting in significantly reduced computational efficiency and increased compute time requirements.
- The second major problem, tensor sparsity, in an inherent characteristic of modern neural networks and can be exposed via techniques like sparsity aware training or distillation. As shown in Figure 2, the problem is, again, an inefficient matrix organization that requires the processor to perform a lot of computation for relatively few results.
- The third challenge is Scalability. There should be an efficient, consistent and common recipe to build both small GEMM engines as well as large GEMM engines. This gives architects flexibility in designing their system as composed of many small cores with small GEMM engines or few big cores with large GEMM engines.
Improving GEMM Performance
GEMMs account for about 70% of all compute cycles during training, so anything that can accelerate them will have a positive effect on overall performance. A number of custom accelerators have been proposed for GEMMs. What’s the best acceleration approach?
We looked at ongoing development trends for GEMMs and came up with three goals for improving GEMM engines:
- Flexibility: GEMM engines should be able to efficiently run matrixes of a wide range of dimensions. As mentioned earlier, this has not always been the case in the past.
- Sparsity Support: Care must be taken to utilize hardware resources as efficiently as possible, which means that the processes that wastefully process cells containing zeros must be minimized.
- Intel Distribution of OpenVINO™ Toolkit: https://software.intel.com/en-us/openvino-toolkit
- Scalability: GEMM performance needs to scale efficiently across a range of different types of accelerators—for example, from small cores in CPUs or GPUs to large cores in future TPUs.
SIGMA, a New Engine for Sparse Irregular GEMM Acceleration.
However, state-of-the-art accelerator and GPUs do not adequately meet all the goals above. We joined together with researchers from the Georgia Institute of Technology to develop a new approach, and our results won the Best Paper Award at the 2020 IEEE International Symposium on High-Performance Computer Architecture. We call it SIGMA (short for Sparse and Irregular GEMM Accelerator), an architecture for DNN training designed specifically to accelerate sparse and irregular GEMMs using flexible interconnects. The fundamental building block within SIGMA’s compute fabric is a processor called the Flexible Dot Product Engine (Flex-DPE). Scalability in the design is achieved by combining several Flex-DPEs on the SIGMA fabric, with GEMM scheduled on this fabric. The Flex-DPE design enables mapping for both dense and sparse GEMMs, as well as regular and irregular ones.
To evaluate SIGMA, we compared it against other state-of-the-art accelerators, and we ran a battery of simulations. Here are the results:
SIGMA delivers flexibility: In keeping with the stated goals, SIGMA enables arrays to reflect any aspect ratio, eliminating the requirement for rigid dimensions. In essence, it converts an irregularly shaped array into a better-organized, regular one. The result: an average of 1.8x faster* processing of irregular arrays.
SIGMA delivers better handling of sparsity: It eliminates much of the performance degradation caused by data sparsity. It does this by mapping only non-zero values. A “sparsity filter” reorganizes data for more efficient processing, reducing the number of cycles required for computation. This results in an average of 3x faster* performance on sparse architectures (and up 5.7x faster* processing time for arrays that are both sparse and irregular). Similar to what was seen with irregular data, remapped matrices contain a higher percentage of occupied cells, improving performance efficiency.
GEMMs deliver more scalable performance: Previous GEMMs have not been scalable, so efficiency dropped off significantly as the amount of data to be processed increased. Thanks to Flex-DPE architecture, SIGMA enables “small networks to all networks” scalability via a high-bandwidth bus.
Summing It Up
GEMM is a critically important component of Deep Learning systems. But the workloads run on those systems are often irregular and sparse, and they scale poorly. This causes performance problems with existing rigid arrays. SIGMA helps solve this problem; providing flexibility and scalability, it enables a high degree of utilization of processing elements, regardless of data irregularity or sparseness. SIGMA streamlines processing by efficiently mapping data arrays so there are fewer “empty” matrix cells requiring processing. The result is improved AI performance and higher effective TFLOPs/W (teraflops per watt) for a wide range of AI workloads*.
As promising as SIGMA is, the possibilities for future enhancements are many. We are investigating other GEMM optimizations, including power gating and software stack design. It’s likely that GEMMs will continue to be enhanced to enable further improvement in AI performance well into the future.
Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. No computer system can be absolutely secure. Check with your system manufacturer or retailer or learn more at intel.com. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries.
Other names and brands may be claimed as the property of others.
ACG5901RPW 2020
© Intel Corporation
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.