multithreading performance of MKL DGEMM on Xeon

Peter_Johnson · ‎11-24-2020

Dear All,

When I was running DGEMM using MKL on Xeon platforms, I noticed sometimes hyperthreading does improves the performance while sometimes it does not. Please refer to the table below. All the machines are hyperthreading-enabled in BIOS. Thread numbers are set using export MKL_NUM_THREADS=XX from the terminal.

	W-2255	Silver 4216	Gold 6242
Spec: core/thread	10C/20T, 1 node	16C/32T, 2 nodes	16C/32T, 1 node
threads = physical cores	892 GFLOPS (10T)	784 GFLOPS (32T)	1156 GFLOPS
peak performance	1056 GFLOPS	819 GFLOPS	1280 GFLOPS
threads =logical cores	same as 10 T	same as 32T	1938 GFLOPS
peak performance	- [1]	- [2]	2560 GFLOPS [3] ?

I put slash at [1] and [2] because it is confusing to me on how to calculate the peak performance right now. Using Intel Xeon Gold 6242 as an example, its turbo boost frequency under AVX512/16cores is 2.5 GHz. Therefore, its 16-core peak performance should be:

GFLOP/core: 512 (bit) / 64 (bit) * 2 (FMA) * 2 (FMA/core) = 32

2.5 GHz * 32 FLOP/core * 16 cores = 1280 GFLOPS.

However, when using 32 threads, the performance of DGEMM further increases to 1938 GFLOPS, exceeding the theoretical peak performance calculated using # of physical cores.

So the first question is: how do we count the theoretical peak performance of a platform supporting Intel hyperthreading technique? Do we use # of physical cores of the # of logical cores?

Then it comes to the second question. Why (it seems that) hyperthreading does not benefit DGEMM-MKL's performance on my W-2255 and Silver 4216?

Thank you very much for your time!

McCalpinJohn · ‎12-01-2020

Are you sure that the Gold 6242 system only has one processor installed?

The peak performance for AVX-512 code on a single Gold 6242 processor will depend on the average frequency that the processor can sustain. The 6242 has a "base" AVX-512 frequency of 1.9 GHz and a maximum all-core AVX-512 frequency of 2.5 GHz, giving 972.8 GFLOPS to 1280 GFLOPS as the expected range of peak performance.

For large enough (square) matrices, DGEMM performance typically asymptotes to about 90% of the peak performance (based on the actual frequency sustained during the run). This leads to sustained performance estimates in the range of 875 GFLOPS to 1152 GFLOPS. In my experience, the frequency of Xeon Scalable Processors running DGEMM is pretty close to the maximum. (Running HPL on the other hand, usually brings the frequency down closer to the "base" AVX-512 frequency. Both benchmarks spend almost all their time in the DGEMM kernel, but there are differences in the amount of data motion that may account for the difference in power consumption. Still have not gotten around to exploring this one....)

For the Silver 4216, the peak performance is computed differently because there is only one AVX-512 FMA unit. For AVX-512 mode, the peak for a single socket is 16 cores * 16 FLOPS/cycle * 1.6 GHz = 409.6 GFLOPS (based on the maximum all-core AVX512 frequency of 1.6 GHz), or as low as 281.6 GFLOPS at the "base" AVX512 frequency of 1.1 GHz. (With only one AVX512 FMA unit, frequency is unlikely to drop much below 1.6 GHz). You may actually get better performance on the "single-AVX512-FMA" processors by running AVX256 code. The peak for AVX256 code is also 16 FLOPS/cycle (2 AVX256 FMAs vs 1 AVX512 FMA), and the maximum all-core frequency is higher -- 2.3 GHz on the Silver 4216.

There is a magic environment variable that can be used to force MKL to run a different code path. I have not tried this recently, but MKL_DEBUG_CPU_TYPE=5 forced execution using AVX2/256-bit instructions the last time I tested this.....

jimdempseyatthecove · ‎01-06-2021

What are the environment variables (pertinent to OpenMP and MKL)?

You list MKL_NUM_THREADS but do not list:

KMP_AFFINITY
KMP_HW_SUBSET
OMP_NUM_THREADS
OMP_PLACES
OMP_PROC_BIND
(any others if applicable)

On W-2255 try (for 1t/core)

MKL_NUM_THREADS=
OMP_NUM_THREADS=
OMP_PLACES=
OMP_PROC_BIND=
KMP_AFFINITY=scatter
KMP_HW_SUBSET=10c1t

and for 2t/core

MKL_NUM_THREADS=
OMP_NUM_THREADS=
OMP_PLACES=
OMP_PROC_BIND=
KMP_AFFINITY=scatter
KMP_HW_SUBSET=10c2t

For Silver, change the "10c" to "16c" for both tests.

At times, the environment variables will conflict with one another. IOW use only one method for pinning.

Jim Dempsey

Khang_N_Intel · ‎05-24-2021

Closing the ticket! The user never replied back to the request for more information from Jim!