When I was running DGEMM using MKL on Xeon platforms, I noticed sometimes hyperthreading does improves the performance while sometimes it does not. Please refer to the table below. All the machines are hyperthreading-enabled in BIOS. Thread numbers are set using export MKL_NUM_THREADS=XX from the terminal.
|W-2255||Silver 4216||Gold 6242|
|Spec: core/thread||10C/20T, 1 node||16C/32T, 2 nodes||16C/32T, 1 node|
|threads = physical cores||892 GFLOPS (10T)||784 GFLOPS (32T)||1156 GFLOPS|
|peak performance||1056 GFLOPS||819 GFLOPS||1280 GFLOPS|
|threads =logical cores||same as 10 T||same as 32T||1938 GFLOPS|
|peak performance||- ||- ||2560 GFLOPS  ?|
I put slash at  and  because it is confusing to me on how to calculate the peak performance right now. Using Intel Xeon Gold 6242 as an example, its turbo boost frequency under AVX512/16cores is 2.5 GHz. Therefore, its 16-core peak performance should be:
GFLOP/core: 512 (bit) / 64 (bit) * 2 (FMA) * 2 (FMA/core) = 32
2.5 GHz * 32 FLOP/core * 16 cores = 1280 GFLOPS.
However, when using 32 threads, the performance of DGEMM further increases to 1938 GFLOPS, exceeding the theoretical peak performance calculated using # of physical cores.
So the first question is: how do we count the theoretical peak performance of a platform supporting Intel hyperthreading technique? Do we use # of physical cores of the # of logical cores?
Then it comes to the second question. Why (it seems that) hyperthreading does not benefit DGEMM-MKL's performance on my W-2255 and Silver 4216?
Thank you very much for your time!
Are you sure that the Gold 6242 system only has one processor installed?
The peak performance for AVX-512 code on a single Gold 6242 processor will depend on the average frequency that the processor can sustain. The 6242 has a "base" AVX-512 frequency of 1.9 GHz and a maximum all-core AVX-512 frequency of 2.5 GHz, giving 972.8 GFLOPS to 1280 GFLOPS as the expected range of peak performance.
For large enough (square) matrices, DGEMM performance typically asymptotes to about 90% of the peak performance (based on the actual frequency sustained during the run). This leads to sustained performance estimates in the range of 875 GFLOPS to 1152 GFLOPS. In my experience, the frequency of Xeon Scalable Processors running DGEMM is pretty close to the maximum. (Running HPL on the other hand, usually brings the frequency down closer to the "base" AVX-512 frequency. Both benchmarks spend almost all their time in the DGEMM kernel, but there are differences in the amount of data motion that may account for the difference in power consumption. Still have not gotten around to exploring this one....)
For the Silver 4216, the peak performance is computed differently because there is only one AVX-512 FMA unit. For AVX-512 mode, the peak for a single socket is 16 cores * 16 FLOPS/cycle * 1.6 GHz = 409.6 GFLOPS (based on the maximum all-core AVX512 frequency of 1.6 GHz), or as low as 281.6 GFLOPS at the "base" AVX512 frequency of 1.1 GHz. (With only one AVX512 FMA unit, frequency is unlikely to drop much below 1.6 GHz). You may actually get better performance on the "single-AVX512-FMA" processors by running AVX256 code. The peak for AVX256 code is also 16 FLOPS/cycle (2 AVX256 FMAs vs 1 AVX512 FMA), and the maximum all-core frequency is higher -- 2.3 GHz on the Silver 4216.
There is a magic environment variable that can be used to force MKL to run a different code path. I have not tried this recently, but MKL_DEBUG_CPU_TYPE=5 forced execution using AVX2/256-bit instructions the last time I tested this.....
What are the environment variables (pertinent to OpenMP and MKL)?
You list MKL_NUM_THREADS but do not list:
(any others if applicable)
On W-2255 try (for 1t/core)
and for 2t/core
For Silver, change the "10c" to "16c" for both tests.
At times, the environment variables will conflict with one another. IOW use only one method for pinning.