MKL Performance Improvement Suggestion

jimdempseyatthecove — Sat, 02 Jan 2021 20:12:48 GMT

On a Windows system with multiple NUMA nodes and large number of cores, it is not unusual to have an MKL function call where MKL will examine the argument dimensions and then select a reduced set of logical processors for an (intended) optimal performance of the function.

My observations seem to indicate that when MKL chooses a subset of the available (process/calling thread's constricted) affinities that the subset # threads are selected from the first # of threads from the available affinities, as opposed to using the hardware topology of the available threads.

For example, KNL 7210 configured with 4 NUMA nodes (each one processor group on Windows, HT enabled, 4t/c, each NUMA node has 64 HW threads, 16 cores, 8 L2's. The optimal pick order, within this node (assuming all HW threads are pins of the calling thread) would be:

0,8,16,24,32,40,48,56 (1st thread of 1st core of each L2) then
4,12,20,28,36,44,52,60 (1st thread 2nd core of each L2) then
1,9,17,25,33,41,49,57 (2nd thread of 1st core of each L2) then
...

This observation was made on system with KNL 7210

OMP: Info #155: KMP_AFFINITY: Initial OS proc set not respected: 64-127
OMP: Info #214: KMP_AFFINITY: decoding x2APIC ids.
OMP: Info #156: KMP_AFFINITY: 256 available OS procs
OMP: Info #157: KMP_AFFINITY: Uniform topology
OMP: Info #285: KMP_AFFINITY: topology layer "LL cache" is equivalent to "core".
OMP: Info #285: KMP_AFFINITY: topology layer "L1 cache" is equivalent to "core".
OMP: Info #191: KMP_AFFINITY: 1 socket x 64 cores/socket x 4 threads/core (64 total cores)

The test was making repeated calls to the MKL function HEEVR using complex double array dimensioned (500,500). Where there is 1 calling thread pinned to the 64 HW threads of a node.

MKL is likely choosing lesser than 64 threads to perform these computation.

Note, the test program reads the OMP_PLACES and then affinities the respective OpenMP thread to its place (only 1 place in the data below). The environment vars are listed as well as the run time affinities in processor group are listed:

My suspicion is that a subset of the 64 logical processors are taken from the first N available processors of the calling thread.

SET OMP_NUM_THREADS=
SET KMP_HW_SUBSET=
SET OMP_PPROC_BIND=
SET KMP_AFFINITY=
SET OMP_PLACES={0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60}
OpenMP Thread 0 ProcessorGroup 0 Pinned: 0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60
1tPlaceNode(0), 8.3100

SET OMP_NUM_THREADS=
SET KMP_HW_SUBSET=
SET OMP_PPROC_BIND=
SET KMP_AFFINITY=
SET OMP_PLACES={0:2,4:2,8:2,12:2,16:2,20:2,24:2,28:2,32:2,36:2,40:2,44:2,48:2,52:2,56:2,60:2}
OpenMP Thread 0 ProcessorGroup 0 Pinned: 0 1 4 5 8 9 12 13 16 17 20 21 24 25 28 29 32 33 36 37 40 41 44 45 48 49 52 53 56 57 60 61
2tPlaceNode(0), 11.8300

SET OMP_NUM_THREADS=
SET KMP_HW_SUBSET=
SET OMP_PPROC_BIND=
SET KMP_AFFINITY=
SET OMP_PLACES={0:3,4:3,8:3,12:3,16:3,20:3,24:3,28:3,32:3,36:3,40:3,44:3,48:3,52:3,56:3,60:3}
OpenMP Thread 0 ProcessorGroup 0 Pinned: 0 1 2 4 5 6 8 9 10 12 13 14 16 17 18 20 21 22 24 25 26 28 29 30 32 33 34 36 37 38 40 41 42 44 45 46 48 49 50 52 53 54 56 57 58 60 61 62
3tPlaceNode(0), 16.1700

SET OMP_NUM_THREADS=
SET KMP_HW_SUBSET=
SET OMP_PPROC_BIND=
SET KMP_AFFINITY=
SET OMP_PLACES={0:64}
OpenMP Thread 0 ProcessorGroup 0 Pinned: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63
4tPlaceNode(0), 19.6800

NOTE

The slow down is NOT a case of HyperThreading is slower than no HyperThreading, rather it is a case of poor thread selection in MKL when it subsets the threads for the function (given the size of the problem). To wit:

The bottom labels are # cores used, cores are spread across NUMA nodes, then within node. In nearly all cases using 2, 3 or 4 HT/core was faster. It is undetermined when 4t/c, and to lesser extent 3t/c is so jagged. Lack of knowledge of what is going on inside MKL hinders further investigation.

Jim Dempsey

Re: MKL Performance Improvement Suggestion

jimdempseyatthecove — Sat, 02 Jan 2021 20:33:56 GMT

Here is a chart of percentage improvement across cores 1, 2, 3, & 4 t/c

In this test using 2HTs/core adds between 20% and 30% boost in performance. YMMV

Jim Dempsey

Re:MKL Performance Improvement Suggestion

RahulV_intel — Wed, 06 Jan 2021 10:23:15 GMT

Hi,

Thanks for providing your suggestions. We are forwarding this query to the concerned team.

Regards,

Rahul

Re: Re:MKL Performance Improvement Suggestion

jimdempseyatthecove — Thu, 07 Jan 2021 20:41:29 GMT

There is better diagnosis of the issue here.

It seems that MKL is using the calling Process affinity mask instead of the calling Thread affinity mask. (On Windows)

Jim Dempsey

Re:MKL Performance Improvement Suggestion

Khang_N_Intel — Tue, 25 May 2021 05:06:32 GMT

Hi Jim,

We can no longer access to KNL systems in order to validate your finding.

I apologize for the inconvenience.

Best,

Khang

Re:MKL Performance Improvement Suggestion

MRajesh_intel — Thu, 24 Jun 2021 08:33:54 GMT

Hi,

Can you please let us know the oneAPI version used ?

Re:MKL Performance Improvement Suggestion

MRajesh_intel — Tue, 29 Jun 2021 05:48:13 GMT

Hi,

We are closing this thread as we no longer support KNL machines. Please visit the system requirements for further information. If you require any additional assistance from Intel, please start a new thread. Any further interaction in this thread will be considered community only.

Link: https://software.intel.com/content/www/us/en/develop/articles/oneapi-math-kernel-library-system-requirements.html

Have a Good day.

Regards

Rajesh

topic Re: MKL Performance Improvement Suggestion in Intel® oneAPI Math Kernel Library

MKL Performance Improvement Suggestion

Re: MKL Performance Improvement Suggestion

Re:MKL Performance Improvement Suggestion

Re: Re:MKL Performance Improvement Suggestion

Re:MKL Performance Improvement Suggestion

Re:MKL Performance Improvement Suggestion

Re:MKL Performance Improvement Suggestion