Applying the Roofline Model for Deep Learning Performance Optimizations

MaryT_Intel · ‎09-14-2020

Introduction

When optimizing PaddlePaddle framework we rely mostly on oneDNN library as it provides optimized Deep Learning algorithms implementations e.g. primitives. During our work we often communicate with oneDNN team and express our feedback and expectations on mentioned Deep Learning primitives. To help ourselves in having realistic expectations towards functionality delivered by oneDNN we developed a tool that is drawing a roofline plots for used computer and place evaluated Deep Learning primitive onto this plot. That way we can see evaluated oneDNN primitive processor utilization in a context of available computational resources for this kind (in terms of arithmetic intensity) of algorithm.

The Roofline Model

The Roofline model is a methodology for visual representation of platforms that can be used to:

• Estimate boundaries for performance gain from introducing new features e.g. multithreading and vectorization
• Estimate limitations for improvement of a kernel’s implementation
• Explain efficiency of an existing kernel
• Compare performance of computing platforms

The Roofline model ties a kernel’s representation with platform capabilities (represented by roof), so evaluated kernel maximal performance is bounded by the roof at a corresponding arithmetic intensity of kernel:

Simplified example of roofline model:

The Roofline model relates the performance of the computer and memory traffic between the caches and DRAM. The model uses arithmetic intensity, (operations per byte of DRAM traffic), defining total bytes transferred to main memory after they have been filtered by the cache hierarchy. Thus, we obtained DRAM bandwidth needed by a kernel, what can discover bottleneck parts on the tested machine. The Roofline model is a 2D graph based on floating-point performance, memory performance and arithmetic intensity.

To plot the Roofline model, we needed to gather characteristics of the computing platform and algorithm implementation (referred to as kernel) executed on that device, namely:

• Peak computational efficiency: π
• Peak memory throughput: T
• Amount of Floating point operations of kernel (Work) : W
• Memory traffic of kernel: Q
• Time of execution of kernel (Runtime): R

Measurements:

We implemented program to benchmark used hardware platform (processor and memory) and to measure all characteristics listed in previous section for selected kernels (oneDNN primitives):

• activation (GELU)
• convolution
• inner product
• layer normalization
• pooling (average)

Mentioned characteristics were taken by accessing Performance Monitoring Units (PMU).

Detailed description of this methodology is presented in full article related to this work.

All measurements where taken in each of use cases of target processor:

1. Single-threaded execution
2. Single socket execution
3. Two sockets execution.

Analysis

We started our analysis with convolution primitive using only single-threaded execution. This is an applicable use case for the PaddlePaddle deep learning framework which is optimized for single-threaded execution. We plotted the Roofline model of convolution operations using a fixed size of data to process in three sub-cases (vertical dashed lines from left to right in the Figure):

• Execution of convolution using Winograd*[9] algorithm
• Execution of convolution using NCHW data arrangement
• Execution of convolution using NCHW16C (blocked) data arrangement

First, we had three different convolutional kernels on the Roofline plot.

Apart from the relative utilization of compute capabilities (runtime compute) we also measured relative execution time (ET). NCHW convolution is the slowest so we denoted its ET as 100%. We can see that the NCHW16C convolutional kernel is slightly more efficiently implemented as it utilizes 86% of peak compute, as opposed to the NCHW convolutional kernel which uses only 48% of available computational resources. This is quite intuitive; we compare two different implementations, conceptually the same kind of algorithm is performing same mathematical operations using roughly the same amount of FLOPS. Winograd convolution on the other hand, is a totally different algorithm, which ultimately produces the same results using a different calculation method. Hence, comparing kernels when implementing totally different algorithms has very limited sense. It is more on how well a given kernel will utilize computing platform resources. We can see that Winograd convolution utilization is much lower (31%), yet it is the fastest one among the three presented. What we can see is that the implementation of Winograd has a room for improvement as its runtime compute is far from roof. Although Winograd is the fastest, its applications are limited to specific sizes of convolutional kernels , so direct convolution algorithm is of much wider use. Next we looked to compare two implementations of direct convolution NCHW versus cache and vectorization-friendly NCHW16C. The Intel oneDNN Library is implementing the idea of layout propagation*[4] in a way that convolutional models input is converted from its original data arrangement to a blocked data arrangement (for example NCHW8C or NCHW16C). Then all subsequent deep learning operations (convolutions, normalization, non-linearities) work on this data arrangement. Blocked data arrangements help to ensure that all data used by vector instruction6 comes from the same single cacheline thus reducing memory latency and helping to achieve higher computational utilization. AVX,AVX2, AVX512.. We can see that the percentage of total compute utilization is much higher for NCHW16C than for NCHW data arrangement. Most compute friendly scenarios, such as convolution executed using NCHW16C data layout, achieve over 86.0% of maximal FLOPS available on the processor. Such a high compute utilization rate indicates that further optimization of this implementation (without conceptual redesigning or changing the convolutional algorithm) will be difficult. It may be easier to change algorithm to more efficient if one exists. One option may be to replace direct convolution with Winograd*[9] convolution (if applicable) as discussed at the beginning of this section.

Next we run experiments on convolution primitives, but when allowed to be executed using all computational and memory resources of single socket (single node). We can see that the respective compute resources utilization is slightly lower relatively to single-threaded situation:

• Winograd convolution: from 31.54% to 29.30%
• Direct NCHW convolution: from 48.73% to 45.68%
• Direct NCHW16C convolution: from 86.72% to 78.01%

We attribute it partially to multithreading handling and partially to memory prefetcher / cache limitations. Without more deeper analysis it is difficult to draw a different conclusion other than that it is easier to implement an efficient single-threaded kernel than a multi-threaded one. Another observation drawn from the presented Roofline model is that as we migrate execution of evaluated convolutions from a single thread to one socket or to two sockets execution, we can see that less efficient implementations are starting to become memory bound. The explanation for this is not related to the algorithms, it is that the rigid point of the Roofline model was moved further right. This is because memory bandwidth available per thread when using all hardware threads are available is lower than in the case of single thread execution.

Apart from analyzing compute bound primitives like convolutions or Inner products we have analyzed memory bound primitives like: activations, layer normalization and average pooling. We attempted to analyze the pooling primitive using the Roofline model using two most popular pooling algorithms:

• max pooling
• average pooling

For max pooling, the methodology used in this work is not applicable to this operation as max pooling consists of data movement and max operation which are not recognized as FLOPS and not traced by relevant FLOPS PMU counters. Therefore the work value will be counted will not be representative and useful. In this paper, we present only the Roofline plots for average pooling. Following figure shows that arithmetic intensity for NCHW and blocked layout data arrangement (NCHW16C) in a situation with cold caches is almost the same.

The same observation applies to the warmed caches scenario. This is not very surprising in itself, but an interesting observation is that there is a huge difference in the percentage of CPU compute utilization. Implementations using NCHW data arrangement achieved 0.35% of compute utilization and NCHW16C implementation are utilizing around 14.8 % which is over 42 x better utilization. We found this interesting and searched for an explanation. The Intel oneDNN library can work in verbose mode to provide details of internal execution as presented below:

•NCHW:

dnnl_verbose , exec , cpu , p o o l i n g , simple_nchw : any , forward_inference, . . .

•NCHW16C:

dnnl_verbose , exec , cpu , p o o l i n g , j i t : avx512_common , forward_inference , . . .

Based on those outputs we can see that NCHW is using an average pooling implementation named : simple_nchw and the blocked data arrangement is using jit::avx512_common implementation. The former is a C++ based naive implementation and the latter one is a runtime generated assembly code that was implemented using the Xbyak*[14] project. NCHW pooling requires doing operations with-in simd register (as spatial has stride 1), while NHWC and NCHW16C pooling could directly operate on registers. This is the primary reason for NCHW being that low on compute utilization.

Acknowledgment

The authors would like to express our gratitude to Krzysztof Badziak and Mateusz Ozga from Intel Corporation for their advice on optimizations and to Andres Rodriguez, Evarist Fomenko and Emily Hutson for reviewing this article and to Michal Lukaszewski and Michal Chruscinski for providing and preparing platform to run experiments on.

More information

More data , methodology description as well as configuration used are included in full article related to this work.

_{Notices and Disclaimers}

_{Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit: www.intel.com/benchmarks.}