- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
On a Windows system with multiple NUMA nodes and large number of cores, it is not unusual to have an MKL function call where MKL will examine the argument dimensions and then select a reduced set of logical processors for an (intended) optimal performance of the function.
My observations seem to indicate that when MKL chooses a subset of the available (process/calling thread's constricted) affinities that the subset # threads are selected from the first # of threads from the available affinities, as opposed to using the hardware topology of the available threads.
For example, KNL 7210 configured with 4 NUMA nodes (each one processor group on Windows, HT enabled, 4t/c, each NUMA node has 64 HW threads, 16 cores, 8 L2's. The optimal pick order, within this node (assuming all HW threads are pins of the calling thread) would be:
0,8,16,24,32,40,48,56 (1st thread of 1st core of each L2) then
4,12,20,28,36,44,52,60 (1st thread 2nd core of each L2) then
1,9,17,25,33,41,49,57 (2nd thread of 1st core of each L2) then
...
This observation was made on system with KNL 7210
OMP: Info #155: KMP_AFFINITY: Initial OS proc set not respected: 64-127
OMP: Info #214: KMP_AFFINITY: decoding x2APIC ids.
OMP: Info #156: KMP_AFFINITY: 256 available OS procs
OMP: Info #157: KMP_AFFINITY: Uniform topology
OMP: Info #285: KMP_AFFINITY: topology layer "LL cache" is equivalent to "core".
OMP: Info #285: KMP_AFFINITY: topology layer "L1 cache" is equivalent to "core".
OMP: Info #191: KMP_AFFINITY: 1 socket x 64 cores/socket x 4 threads/core (64 total cores)
The test was making repeated calls to the MKL function HEEVR using complex double array dimensioned (500,500). Where there is 1 calling thread pinned to the 64 HW threads of a node.
MKL is likely choosing lesser than 64 threads to perform these computation.
Note, the test program reads the OMP_PLACES and then affinities the respective OpenMP thread to its place (only 1 place in the data below). The environment vars are listed as well as the run time affinities in processor group are listed:
My suspicion is that a subset of the 64 logical processors are taken from the first N available processors of the calling thread.
SET OMP_NUM_THREADS=
SET KMP_HW_SUBSET=
SET OMP_PPROC_BIND=
SET KMP_AFFINITY=
SET OMP_PLACES={0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60}
OpenMP Thread 0 ProcessorGroup 0 Pinned: 0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60
1tPlaceNode(0), 8.3100
SET OMP_NUM_THREADS=
SET KMP_HW_SUBSET=
SET OMP_PPROC_BIND=
SET KMP_AFFINITY=
SET OMP_PLACES={0:2,4:2,8:2,12:2,16:2,20:2,24:2,28:2,32:2,36:2,40:2,44:2,48:2,52:2,56:2,60:2}
OpenMP Thread 0 ProcessorGroup 0 Pinned: 0 1 4 5 8 9 12 13 16 17 20 21 24 25 28 29 32 33 36 37 40 41 44 45 48 49 52 53 56 57 60 61
2tPlaceNode(0), 11.8300
SET OMP_NUM_THREADS=
SET KMP_HW_SUBSET=
SET OMP_PPROC_BIND=
SET KMP_AFFINITY=
SET OMP_PLACES={0:3,4:3,8:3,12:3,16:3,20:3,24:3,28:3,32:3,36:3,40:3,44:3,48:3,52:3,56:3,60:3}
OpenMP Thread 0 ProcessorGroup 0 Pinned: 0 1 2 4 5 6 8 9 10 12 13 14 16 17 18 20 21 22 24 25 26 28 29 30 32 33 34 36 37 38 40 41 42 44 45 46 48 49 50 52 53 54 56 57 58 60 61 62
3tPlaceNode(0), 16.1700
SET OMP_NUM_THREADS=
SET KMP_HW_SUBSET=
SET OMP_PPROC_BIND=
SET KMP_AFFINITY=
SET OMP_PLACES={0:64}
OpenMP Thread 0 ProcessorGroup 0 Pinned: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63
4tPlaceNode(0), 19.6800
NOTE
The slow down is NOT a case of HyperThreading is slower than no HyperThreading, rather it is a case of poor thread selection in MKL when it subsets the threads for the function (given the size of the problem). To wit:
The bottom labels are # cores used, cores are spread across NUMA nodes, then within node. In nearly all cases using 2, 3 or 4 HT/core was faster. It is undetermined when 4t/c, and to lesser extent 3t/c is so jagged. Lack of knowledge of what is going on inside MKL hinders further investigation.
Jim Dempsey
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Here is a chart of percentage improvement across cores 1, 2, 3, & 4 t/c
In this test using 2HTs/core adds between 20% and 30% boost in performance. YMMV
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Thanks for providing your suggestions. We are forwarding this query to the concerned team.
Regards,
Rahul
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
There is better diagnosis of the issue here.
It seems that MKL is using the calling Process affinity mask instead of the calling Thread affinity mask. (On Windows)
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Jim,
We can no longer access to KNL systems in order to validate your finding.
I apologize for the inconvenience.
Best,
Khang
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Can you please let us know the oneAPI version used ?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
We are closing this thread as we no longer support KNL machines. Please visit the system requirements for further information. If you require any additional assistance from Intel, please start a new thread. Any further interaction in this thread will be considered community only.
Have a Good day.
Regards
Rajesh
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page