- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have created a test suite for testing OpenMP with MKL
On a Windows system with more than 64 logical processors, Windows divides the available processors into Processor Groups, each containing less than 1:64 logical processors. This is a KNL with 64 cores, 4 HSs/core, 4 NUMA nodes, 4 Windows Processor groups.
OpenMP has OMP_PLACES such that each place can have 1 or more logical processors (must be from same processor group).
Each OpenMP thread is assigned to a place. (IOW it is affinitied to one or more logical processors within the group). This works.
MKL is documented as having mkl_set_num_threads_local that can be used to set the number of MKL threads of the calling threads affinity to be used within MKL.
This feature works when there is 1 Processor Group.
However, when their are multiple processor groups, the MKL threads (appear to) get assigned to the logical processor numbers of goup(0) as opposed to those relative to the calling thread's group number.
OnePlacePerNodeByNodes
OpenMP Thread 0 ProcessorGroup 0 Pinned: 0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60
Array(5000,5000) OpenMP 1 MKL 16
Building test data
Starting timed section
Thread 0 calc.time = 13.3000[sec]
Total calc.time = 13.6000[sec] Time/Nthreads = 13.6000[sec]
OpenMP Thread 0 ProcessorGroup 0 Pinned: 0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60
OpenMP Thread 1 ProcessorGroup 1 Pinned: 0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60
Array(5000,5000) OpenMP 2 MKL 16
Building test data
Starting timed section
Thread 1 calc.time = 49.6900[sec]
Thread 0 calc.time = 49.9500[sec]
Total calc.time = 50.3000[sec] Time/Nthreads = 25.1500[sec]
OpenMP Thread 0 ProcessorGroup 0 Pinned: 0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60
OpenMP Thread 1 ProcessorGroup 1 Pinned: 0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60
OpenMP Thread 2 ProcessorGroup 2 Pinned: 0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60
Array(5000,5000) OpenMP 3 MKL 16
Building test data
Starting timed section
Thread 2 calc.time = 49.7300[sec]
Thread 1 calc.time = 49.8600[sec]
Thread 0 calc.time = 50.0000[sec]
Total calc.time = 50.4000[sec] Time/Nthreads = 16.8000[sec]
OpenMP Thread 0 ProcessorGroup 0 Pinned: 0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60
OpenMP Thread 1 ProcessorGroup 1 Pinned: 0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60
OpenMP Thread 2 ProcessorGroup 2 Pinned: 0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60
OpenMP Thread 3 ProcessorGroup 3 Pinned: 0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60
Array(5000,5000) OpenMP 4 MKL 16
Building test data
Starting timed section
Thread 2 calc.time = 49.8500[sec]
Thread 1 calc.time = 49.9800[sec]
Thread 0 calc.time = 50.0700[sec]
Thread 3 calc.time = 50.1900[sec]
Total calc.time = 50.6400[sec] Time/Nthreads = 12.6600[sec]
In the above, first section, is using 1 processor group,(1 MKL instance with 16 threads)
group(0) relative processors: 0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60
group(0) system processors: 0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60
Second section using 2 processor groups (2 MKL instances, each 16 threads)
group(0) relative processors: 0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60
group(0) system processors: 0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60
group(1) relative processors: 0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60
group(1) system processors: 64 68 72 76 80 84 88 92 96 100 104 108 112 116 120
My assumption is:
The MKL thread assignment is using ProcessorGroup 0 as opposed to the Processor Group of the calling thread.
If you need a reproducer I will provide it, however, this problem should be relatively easy to locate.
Jim Dempsey
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Additional comment:
When using 1 OpenMP thread, and setting:
save_nt = mkl_set_num_threads_local(OMP_GET_PLACE_NUM_PROCS(omp_get_thread_num()))
Which specifies 16 threads (of place of calling thread's 16 procs), MKL appears to ignore this and uses all the threads of the ProcessorGroup of the calling thead (all 64 of the 16 cores).
However, when using >1 OpenMP thread (and more than one processor group), it uses the specified # threads, but unfortunately they are all located in the same processor group.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It would be better to take this discussion to the MKL forum, unless you think it is something in Intel's OpenMP implementation.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page