dgemm performance regression with 2020 Update 2 multi-threaded application on 32 core Intel Xeon

AdiKwatra · ‎10-26-2020

Starting MKL 2020 Update 2, we are seeing significant dgemm performance regression on a multi-threaded application on a 32 core Intel Xeon processor. MKL is running in sequential mode (-mkl=sequential) and the Intel libraries are statically linked (-intel-static). The performance regression seems more obvious when the processor is heavily loaded.

MKL 2020 Update 1 doesn't have this issue.

Attached is a sample code to reproduce the issue.

PrasanthD_intel · ‎10-26-2020

Hi Aditya,

We have a dedicated forum for MKL(https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/bd-p/oneapi-math-kernel-library). We are redirecting this query to that forum.

Regards

Prasanth

Gennady_F_Intel · ‎10-27-2020

Did you check the latest version 2020 u4 which has been released the last Friday?

What do you mean by significant perf regression?

Are there any specific CPU types where do you see this regression?

AdiKwatra · ‎10-27-2020

I can reproduce the performance regression in MKL 2020 Update 4. Last working version was MKL 2020 Update 1.

On running the attached code which basically runs 10 threads running some dgemm calls in a loop, following are the results based on the time taken in the dgemm calls that is printed as an output.

1. Intel Xeon 32 core: MKL 2020 Update 4 is about 4 times worse than MKL 2020 Update 1

2. Intel Xeon 18 core: MKL 2020 Update 4 is about 1.8 times worse than MKL 2020 Update 1

3. Intel Xeon 4 core: MKL 2020 Update 4 is about 1.2 times worse than MKL 2020 Update 1

Let me know if you have any other questions. Thanks!

AdiKwatra · ‎11-06-2020

Hello,

Following up on this issue. Was wondering if you were able to reproduce the performance regression and when can we expect this to be fixed?

Do let me know if you need any assistance from me.

Thank you.

AdiKwatra · ‎11-30-2020

This issue can also be reproduced on AMD EPYC 64 core processor

Run the code attached to the original post using MKL 2020 Update 1 and MKL 2020 Update 4 in parallel. The printf statements print the time taken by 1 million dgemm calls by 10 identical threads.

It shows that MKL 2020 Update 4 is about 9 times slower than MKL 2020 Update 1 on AMD EPYC 64 core processor.

Do let me know if you have any questions trying to reproduce this issue.

Gennady_F_Intel · ‎12-18-2020

I suggest you submit the official ticket to the Itel Online Service Center.

AdiKwatra · ‎01-07-2021

This problem can also be reproduced with dgels and dgemm JIT APIs. I have an updated code attached with the Makefile included (QuantTest.tar.gz).

It runs dgemm, dgemm JIT and dgels calls in a loop in 20 identical pthreads. Intel libraries are statically linked (-static-intel) and MKL is used in sequential mode (-mkl=sequential). The code prints the average time taken in 100 of these individual dgemm, dgemm JIT and dgels calls.

There seems to be massive thread contention starting MKL 2020 update 2. The attached code (with 20 threads) runs about 8x slower based on the times printed as the output with MKL 2020 update 2, when compared to MKL 2020 update 1. MKL 2020 update 4 also had this issue.

Hardware: Lightly loaded (no other user processes) Intel Xeon 32 core machine

Let me know if you have any suggestion to get around this. I will also submit the official ticket to the Intel Online Service Center.

jimdempseyatthecove · ‎01-10-2021

FWIW In a different thread I had some observations using MKL threaded from sequential and OpenMP parallel process (one or more processes using CAF (COARRAY Fortran) and MPI). In the reported case it was for CAF using sequential main and threaded MKL.

https://community.intel.com/t5/Intel-Fortran-Compiler/COARRAY-process-pinning-bug/td-p/1244239/jump-to/first-unread-message

My observation in summary:

While the particular test regarded MKL threaded, I suspect from your reports that the MKL sequential implementation is using the schema for MKL threaded (1 thread) and pinning its single/sequential thread.

It appears that MKL is, for thread pinning, using the System Affinity's or Process Affinity's for pin selection as opposed to the calling thread's affinity's. This was on a Windows platform, you are on Linux.

This was noted when I would pin calling thread to an exclusive subset of the System/Process Affinity's with either: by thread as OpenMP main or by rank (MPI) or image(CAF).

Not using the calling thread's (to MKL) affinity's resulted in each instance of MKL landing on the same OS procs.

A non-optimal solution to this was to .NOT. pin the threads, as this resulted in better performance.

IMHO This can be considered a bug in MKL or at least a missed optimization opportunity.

Jim Dempsey

AdiKwatra · ‎01-11-2021

I was hoping MKL sequential should not be spawning new threads (not even one). It should be running in a non-threaded mode. Please do correct me if I am wrong.

jimdempseyatthecove · ‎01-11-2021

>>MKL sequential should not be spawning new threads (not even one).

The point I am making w/r/t MKL sequential is the sequential and threaded share the same source code, and the supposition is that if KMP_HW_SUBSET, or KMP_AFFINITY=..., or OMP_PLACES, or.... any of the environment variables for threading are set, that at least a hardware survey is made .AND. the side effect is your only thread (of each rank) gits pinned, and the ranks have there main(only) thread pinned to the same OS proc.

Now this is purely a supposition.

IMHO There is something broken in the affinity code in MKL (and OpenMP as used by MKL threaded). And the supposition is that the threaded code (not conditionalized out) is impacting affinities.

For example, on a 4 NUMA node system, 64 HW threads per node (On Windows one processor group per node on this system)

KMP_HW_SUBSET=2N@2,2L2
Using 5 threads, places the threads in a screwy manner:

Processor Group 2, Pin 0    expected
Processor Group 2, pin 6   *** right group, wrong pin, should be pin 8
Processor Group 2, pin 13 *** right group, wrong pin, should be 16
Processor Group 3, pin 3    *** right group, wrong pin, should be 0
Processor Group 3, pin 10    *** right group, wrong pin, should be 8

Again, your MKL is not threaded, but it contains process pinning code, all I am saying is something may be amiss with this section of code (or that which calls it).

Jim Dempsey

dgemm performance regression with 2020 Update 2 multi-threaded application on 32 core Intel Xeon

Performance