Re: Window+OpenMP+MKL+gt 64 logical processors

jimdempseyatthecove · ‎12-27-2020

I have created a test suite for testing OpenMP with MKL

On a Windows system with more than 64 logical processors, Windows divides the available processors into Processor Groups, each containing less than 1:64 logical processors. This is a KNL with 64 cores, 4 HSs/core, 4 NUMA nodes, 4 Windows Processor groups.

OpenMP has OMP_PLACES such that each place can have 1 or more logical processors (must be from same processor group).

Each OpenMP thread is assigned to a place. (IOW it is affinitied to one or more logical processors within the group). This works.

MKL is documented as having mkl_set_num_threads_local that can be used to set the number of MKL threads of the calling threads affinity to be used within MKL.

This feature works when there is 1 Processor Group.

However, when their are multiple processor groups, the MKL threads (appear to) get assigned to the logical processor numbers of goup(0) as opposed to those relative to the calling thread's group number.

OnePlacePerNodeByNodes
OpenMP Thread 0 ProcessorGroup 0 Pinned: 0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60
Array(5000,5000) OpenMP 1 MKL 16
 Building test data
 Starting timed section
Thread    0 calc.time =    13.3000[sec]
Total calc.time =    13.6000[sec] Time/Nthreads =   13.6000[sec]

OpenMP Thread 0 ProcessorGroup 0 Pinned: 0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60
OpenMP Thread 1 ProcessorGroup 1 Pinned: 0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60
Array(5000,5000) OpenMP 2 MKL 16
 Building test data
 Starting timed section
Thread    1 calc.time =    49.6900[sec]
Thread    0 calc.time =    49.9500[sec]
Total calc.time =    50.3000[sec] Time/Nthreads =   25.1500[sec]

OpenMP Thread 0 ProcessorGroup 0 Pinned: 0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60
OpenMP Thread 1 ProcessorGroup 1 Pinned: 0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60
OpenMP Thread 2 ProcessorGroup 2 Pinned: 0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60
Array(5000,5000) OpenMP 3 MKL 16
 Building test data
 Starting timed section
Thread    2 calc.time =    49.7300[sec]
Thread    1 calc.time =    49.8600[sec]
Thread    0 calc.time =    50.0000[sec]
Total calc.time =    50.4000[sec] Time/Nthreads =   16.8000[sec]

OpenMP Thread 0 ProcessorGroup 0 Pinned: 0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60
OpenMP Thread 1 ProcessorGroup 1 Pinned: 0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60
OpenMP Thread 2 ProcessorGroup 2 Pinned: 0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60
OpenMP Thread 3 ProcessorGroup 3 Pinned: 0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60
Array(5000,5000) OpenMP 4 MKL 16
 Building test data
 Starting timed section
Thread    2 calc.time =    49.8500[sec]
Thread    1 calc.time =    49.9800[sec]
Thread    0 calc.time =    50.0700[sec]
Thread    3 calc.time =    50.1900[sec]
Total calc.time =    50.6400[sec] Time/Nthreads =   12.6600[sec]

In the above, first section, is using 1 processor group,(1 MKL instance with 16 threads)

group(0) relative processors: 0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60
group(0) system processors: 0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60

Second section using 2 processor groups (2 MKL instances, each 16 threads)

group(0) relative processors: 0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60
group(0) system processors: 0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60

group(1) relative processors: 0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60
group(1) system processors: 64 68 72 76 80 84 88 92 96 100 104 108 112 116 120

My assumption is:

The MKL thread assignment is using ProcessorGroup 0 as opposed to the Processor Group of the calling thread.

If you need a reproducer I will provide it, however, this problem should be relatively easy to locate.

Jim Dempsey

jimdempseyatthecove · ‎12-27-2020

Additional comment:

When using 1 OpenMP thread, and setting:

save_nt = mkl_set_num_threads_local(OMP_GET_PLACE_NUM_PROCS(omp_get_thread_num()))

Which specifies 16 threads (of place of calling thread's 16 procs), MKL appears to ignore this and uses all the threads of the ProcessorGroup of the calling thead (all 64 of the 16 cores).

However, when using >1 OpenMP thread (and more than one processor group), it uses the specified # threads, but unfortunately they are all located in the same processor group.

Jim Dempsey

Steve_Lionel · ‎12-27-2020

It would be better to take this discussion to the MKL forum, unless you think it is something in Intel's OpenMP implementation.

jimdempseyatthecove · ‎12-28-2020

Will do.

Jim Dempsey