MKL Performance Improvement Suggestion

jimdempseyatthecove · ‎01-02-2021

On a Windows system with multiple NUMA nodes and large number of cores, it is not unusual to have an MKL function call where MKL will examine the argument dimensions and then select a reduced set of logical processors for an (intended) optimal performance of the function.

My observations seem to indicate that when MKL chooses a subset of the available (process/calling thread's constricted) affinities that the subset # threads are selected from the first # of threads from the available affinities, as opposed to using the hardware topology of the available threads.

For example, KNL 7210 configured with 4 NUMA nodes (each one processor group on Windows, HT enabled, 4t/c, each NUMA node has 64 HW threads, 16 cores, 8 L2's. The optimal pick order, within this node (assuming all HW threads are pins of the calling thread) would be:

0,8,16,24,32,40,48,56 (1st thread of 1st core of each L2) then
4,12,20,28,36,44,52,60 (1st thread 2nd core of each L2) then
1,9,17,25,33,41,49,57 (2nd thread of 1st core of each L2) then
...

This observation was made on system with KNL 7210

OMP: Info #155: KMP_AFFINITY: Initial OS proc set not respected: 64-127
OMP: Info #214: KMP_AFFINITY: decoding x2APIC ids.
OMP: Info #156: KMP_AFFINITY: 256 available OS procs
OMP: Info #157: KMP_AFFINITY: Uniform topology
OMP: Info #285: KMP_AFFINITY: topology layer "LL cache" is equivalent to "core".
OMP: Info #285: KMP_AFFINITY: topology layer "L1 cache" is equivalent to "core".
OMP: Info #191: KMP_AFFINITY: 1 socket x 64 cores/socket x 4 threads/core (64 total cores)

The test was making repeated calls to the MKL function HEEVR using complex double array dimensioned (500,500). Where there is 1 calling thread pinned to the 64 HW threads of a node.

MKL is likely choosing lesser than 64 threads to perform these computation.

Note, the test program reads the OMP_PLACES and then affinities the respective OpenMP thread to its place (only 1 place in the data below). The environment vars are listed as well as the run time affinities in processor group are listed:

My suspicion is that a subset of the 64 logical processors are taken from the first N available processors of the calling thread.

SET OMP_NUM_THREADS=
SET KMP_HW_SUBSET=
SET OMP_PPROC_BIND=
SET KMP_AFFINITY=
SET OMP_PLACES={0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60}
OpenMP Thread 0 ProcessorGroup 0 Pinned: 0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60
1tPlaceNode(0), 8.3100

SET OMP_NUM_THREADS=
SET KMP_HW_SUBSET=
SET OMP_PPROC_BIND=
SET KMP_AFFINITY=
SET OMP_PLACES={0:2,4:2,8:2,12:2,16:2,20:2,24:2,28:2,32:2,36:2,40:2,44:2,48:2,52:2,56:2,60:2}
OpenMP Thread 0 ProcessorGroup 0 Pinned: 0 1 4 5 8 9 12 13 16 17 20 21 24 25 28 29 32 33 36 37 40 41 44 45 48 49 52 53 56 57 60 61
2tPlaceNode(0), 11.8300

SET OMP_NUM_THREADS=
SET KMP_HW_SUBSET=
SET OMP_PPROC_BIND=
SET KMP_AFFINITY=
SET OMP_PLACES={0:3,4:3,8:3,12:3,16:3,20:3,24:3,28:3,32:3,36:3,40:3,44:3,48:3,52:3,56:3,60:3}
OpenMP Thread 0 ProcessorGroup 0 Pinned: 0 1 2 4 5 6 8 9 10 12 13 14 16 17 18 20 21 22 24 25 26 28 29 30 32 33 34 36 37 38 40 41 42 44 45 46 48 49 50 52 53 54 56 57 58 60 61 62
3tPlaceNode(0), 16.1700

SET OMP_NUM_THREADS=
SET KMP_HW_SUBSET=
SET OMP_PPROC_BIND=
SET KMP_AFFINITY=
SET OMP_PLACES={0:64}
OpenMP Thread 0 ProcessorGroup 0 Pinned: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63
4tPlaceNode(0), 19.6800

NOTE

The slow down is NOT a case of HyperThreading is slower than no HyperThreading, rather it is a case of poor thread selection in MKL when it subsets the threads for the function (given the size of the problem). To wit:

The bottom labels are # cores used, cores are spread across NUMA nodes, then within node. In nearly all cases using 2, 3 or 4 HT/core was faster. It is undetermined when 4t/c, and to lesser extent 3t/c is so jagged. Lack of knowledge of what is going on inside MKL hinders further investigation.

Jim Dempsey

jimdempseyatthecove · ‎01-02-2021

Here is a chart of percentage improvement across cores 1, 2, 3, & 4 t/c

In this test using 2HTs/core adds between 20% and 30% boost in performance. YMMV

Jim Dempsey

RahulV_intel · ‎01-06-2021

Hi,

Thanks for providing your suggestions. We are forwarding this query to the concerned team.

Regards,

Rahul

jimdempseyatthecove · ‎01-07-2021

There is better diagnosis of the issue here.

It seems that MKL is using the calling Process affinity mask instead of the calling Thread affinity mask. (On Windows)

Jim Dempsey

Khang_N_Intel · ‎05-24-2021

Hi Jim,

We can no longer access to KNL systems in order to validate your finding.

I apologize for the inconvenience.

Best,

Khang

MRajesh_intel · ‎06-24-2021

Hi,

Can you please let us know the oneAPI version used ?

MRajesh_intel · ‎06-28-2021

Hi,

We are closing this thread as we no longer support KNL machines. Please visit the system requirements for further information. If you require any additional assistance from Intel, please start a new thread. Any further interaction in this thread will be considered community only.

Link: https://software.intel.com/content/www/us/en/develop/articles/oneapi-math-kernel-library-system-requirements.html

Have a Good day.

Regards

Rajesh