Thread pinning with OpenMP

velvia · ‎01-13-2017

Hi,

I need to make scaling graphs for an OpenMP application.

My machine is a Dual-Xeon (14 cores per Xeon), with hyper-threading. I would like to place threads using the OpenMP 4 standard, so using OMP_PLACES, OMP_PROC_BIND, OMP_NUM_THREADS.

One of the benchmark is the following: use 4 threads, the first two threads should be bound to the first core of the first socket, and the other 2 threads should be bound to the first core of the second socket. For that, I use:

export OMP_PLACES='{0}, {14}'
export OMP_PROC_BIND=close
export OMP_NUM_THREADS=4

but I am not sure that it does the right job. Bare in mind that I don't want the first and the third thread to be on core 0. I want the first and the second threads to be on this core as I want to use the first touch policy and limits the number of chunks of arrays being allocated in different NUMA domains.

Could you also confirm the number in OMP_PLACES is related to the core number, and that different NUMA domains (including on KNL with the 2 cores on a tile, and the quadrants) are grouped in ascending order.

Thanks for your help,

Francois

TimP · ‎01-13-2017

The Intel-specific setting KMP_AFFINITY=verbose should work with the OpenMP standard directives so as to report the interpretation of your affinity settings.

I would think that numeric values in OMP_PLACES would refer to logical processor numbers, and that you would need to give a full list to accomplish your odd placement. I don't know whether your BIOS would number the 2 logical processes on core 0 as 0,1 or as 0,28 (which I think are the most likely possibilities). 0,14 would seem likely for a single CPU platform. There would be various ways of finding out, including using KMP_AFFINITY=verbose along with OMP_PLACES=cores.

velvia · ‎01-13-2017

Thanks for your help. I did not know about KMP_AFFINITY=verbose.

It turns out that:

0 -> Package 0, Core 0, Thread 0
1 -> Package 0, Core 1, Thread 0
...
13 -> Package 0, Core 14, Thread 0
14 -> Package 1, Core 0, Thread 0
...
27 -> Package 1, Core 14, Thread 0
28 -> Package 0, Core 0, Thread 1
29 -> Package 0, Core 1, Thread 1
...

What is strange is that my 14 cores are numbered 0, 1, 2, ..., 5, 6, 8, 9, ..., 14. So there is a skip at core 7 !

So, as fas as I understand, there is no convention on the numbering scheme, so no portable way of doing things.

Ok,

jimdempseyatthecove · ‎01-13-2017

Can you post the complete KMP_AFFINITY=verbose listing?

While the APIC IDs may have a gap (power of 2 grouping), the host logical processor numbers should not.

Jim Dempsey

velvia · ‎01-13-2017

Hi Jim,

Funny that you came because I am currently working on one of your code (3D diffusion from Pearls volume 1).

The configuration is: Dual Xeon-2660v4, CentOs 7.3, Intel Parallel Studio XE 2017

[fayard@gribouille diffusion_3d]$ export OMP_PLACES='{0},{28},{14},{42}'
[fayard@gribouille diffusion_3d]$ export OMP_PROC_BIND=close
[fayard@gribouille diffusion_3d]$ export OMP_NUM_THREADS=4
[fayard@gribouille diffusion_3d]$ KMP_AFFINITY=verbose ./main 
OMP: Info #204: KMP_AFFINITY: decoding x2APIC ids.
OMP: Info #202: KMP_AFFINITY: Affinity capable, using global cpuid leaf 11 info
OMP: Info #154: KMP_AFFINITY: Initial OS proc set respected: {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55}
OMP: Info #156: KMP_AFFINITY: 56 available OS procs
OMP: Info #157: KMP_AFFINITY: Uniform topology
OMP: Info #179: KMP_AFFINITY: 2 packages x 14 cores/pkg x 2 threads/core (28 total cores)
OMP: Info #206: KMP_AFFINITY: OS proc to physical thread map:
OMP: Info #171: KMP_AFFINITY: OS proc 0 maps to package 0 core 0 thread 0 
OMP: Info #171: KMP_AFFINITY: OS proc 28 maps to package 0 core 0 thread 1 
OMP: Info #171: KMP_AFFINITY: OS proc 1 maps to package 0 core 1 thread 0 
OMP: Info #171: KMP_AFFINITY: OS proc 29 maps to package 0 core 1 thread 1 
OMP: Info #171: KMP_AFFINITY: OS proc 2 maps to package 0 core 2 thread 0 
OMP: Info #171: KMP_AFFINITY: OS proc 30 maps to package 0 core 2 thread 1 
OMP: Info #171: KMP_AFFINITY: OS proc 3 maps to package 0 core 3 thread 0 
OMP: Info #171: KMP_AFFINITY: OS proc 31 maps to package 0 core 3 thread 1 
OMP: Info #171: KMP_AFFINITY: OS proc 4 maps to package 0 core 4 thread 0 
OMP: Info #171: KMP_AFFINITY: OS proc 32 maps to package 0 core 4 thread 1 
OMP: Info #171: KMP_AFFINITY: OS proc 5 maps to package 0 core 5 thread 0 
OMP: Info #171: KMP_AFFINITY: OS proc 33 maps to package 0 core 5 thread 1 
OMP: Info #171: KMP_AFFINITY: OS proc 6 maps to package 0 core 6 thread 0 
OMP: Info #171: KMP_AFFINITY: OS proc 34 maps to package 0 core 6 thread 1 
OMP: Info #171: KMP_AFFINITY: OS proc 7 maps to package 0 core 8 thread 0 
OMP: Info #171: KMP_AFFINITY: OS proc 35 maps to package 0 core 8 thread 1 
OMP: Info #171: KMP_AFFINITY: OS proc 8 maps to package 0 core 9 thread 0 
OMP: Info #171: KMP_AFFINITY: OS proc 36 maps to package 0 core 9 thread 1 
OMP: Info #171: KMP_AFFINITY: OS proc 9 maps to package 0 core 10 thread 0 
OMP: Info #171: KMP_AFFINITY: OS proc 37 maps to package 0 core 10 thread 1 
OMP: Info #171: KMP_AFFINITY: OS proc 10 maps to package 0 core 11 thread 0 
OMP: Info #171: KMP_AFFINITY: OS proc 38 maps to package 0 core 11 thread 1 
OMP: Info #171: KMP_AFFINITY: OS proc 11 maps to package 0 core 12 thread 0 
OMP: Info #171: KMP_AFFINITY: OS proc 39 maps to package 0 core 12 thread 1 
OMP: Info #171: KMP_AFFINITY: OS proc 12 maps to package 0 core 13 thread 0 
OMP: Info #171: KMP_AFFINITY: OS proc 40 maps to package 0 core 13 thread 1 
OMP: Info #171: KMP_AFFINITY: OS proc 13 maps to package 0 core 14 thread 0 
OMP: Info #171: KMP_AFFINITY: OS proc 41 maps to package 0 core 14 thread 1 
OMP: Info #171: KMP_AFFINITY: OS proc 14 maps to package 1 core 0 thread 0 
OMP: Info #171: KMP_AFFINITY: OS proc 42 maps to package 1 core 0 thread 1 
OMP: Info #171: KMP_AFFINITY: OS proc 15 maps to package 1 core 1 thread 0 
OMP: Info #171: KMP_AFFINITY: OS proc 43 maps to package 1 core 1 thread 1 
OMP: Info #171: KMP_AFFINITY: OS proc 16 maps to package 1 core 2 thread 0 
OMP: Info #171: KMP_AFFINITY: OS proc 44 maps to package 1 core 2 thread 1 
OMP: Info #171: KMP_AFFINITY: OS proc 17 maps to package 1 core 3 thread 0 
OMP: Info #171: KMP_AFFINITY: OS proc 45 maps to package 1 core 3 thread 1 
OMP: Info #171: KMP_AFFINITY: OS proc 18 maps to package 1 core 4 thread 0 
OMP: Info #171: KMP_AFFINITY: OS proc 46 maps to package 1 core 4 thread 1 
OMP: Info #171: KMP_AFFINITY: OS proc 19 maps to package 1 core 5 thread 0 
OMP: Info #171: KMP_AFFINITY: OS proc 47 maps to package 1 core 5 thread 1 
OMP: Info #171: KMP_AFFINITY: OS proc 20 maps to package 1 core 6 thread 0 
OMP: Info #171: KMP_AFFINITY: OS proc 48 maps to package 1 core 6 thread 1 
OMP: Info #171: KMP_AFFINITY: OS proc 21 maps to package 1 core 8 thread 0 
OMP: Info #171: KMP_AFFINITY: OS proc 49 maps to package 1 core 8 thread 1 
OMP: Info #171: KMP_AFFINITY: OS proc 22 maps to package 1 core 9 thread 0 
OMP: Info #171: KMP_AFFINITY: OS proc 50 maps to package 1 core 9 thread 1 
OMP: Info #171: KMP_AFFINITY: OS proc 23 maps to package 1 core 10 thread 0 
OMP: Info #171: KMP_AFFINITY: OS proc 51 maps to package 1 core 10 thread 1 
OMP: Info #171: KMP_AFFINITY: OS proc 24 maps to package 1 core 11 thread 0 
OMP: Info #171: KMP_AFFINITY: OS proc 52 maps to package 1 core 11 thread 1 
OMP: Info #171: KMP_AFFINITY: OS proc 25 maps to package 1 core 12 thread 0 
OMP: Info #171: KMP_AFFINITY: OS proc 53 maps to package 1 core 12 thread 1 
OMP: Info #171: KMP_AFFINITY: OS proc 26 maps to package 1 core 13 thread 0 
OMP: Info #171: KMP_AFFINITY: OS proc 54 maps to package 1 core 13 thread 1 
OMP: Info #171: KMP_AFFINITY: OS proc 27 maps to package 1 core 14 thread 0 
OMP: Info #171: KMP_AFFINITY: OS proc 55 maps to package 1 core 14 thread 1 
OMP: Info #242: KMP_AFFINITY: pid 3436 thread 0 bound to OS proc set {0}
OMP: Info #242: KMP_AFFINITY: pid 3436 thread 1 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55}
OMP: Info #242: KMP_AFFINITY: pid 3436 thread 2 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55}
OMP: Info #242: KMP_AFFINITY: pid 3436 thread 3 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55}
OMP: Info #242: OMP_PROC_BIND: pid 3436 thread 1 bound to OS proc set {28}
OMP: Info #242: OMP_PROC_BIND: pid 3436 thread 2 bound to OS proc set {14}
OMP: Info #242: OMP_PROC_BIND: pid 3436 thread 3 bound to OS proc set {42}

Could you please confirm that I am using 2 threads on the first core of the first socket and 2 threads on the first core of the second socket?

By the way, I don't really understand what the following information means.

OMP: Info #242: KMP_AFFINITY: pid 3436 thread 1 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55}
OMP: Info #242: KMP_AFFINITY: pid 3436 thread 2 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55}
OMP: Info #242: KMP_AFFINITY: pid 3436 thread 3 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55}

TimP · ‎01-13-2017

This appears to confirm you have successfully pinned threads to core 0 logical 0 and 28.

McCalpinJohn · ‎01-13-2017

The core numbers used by the "verbose" option of KMP_AFFINITY are not the same as any processor number directly visible to the operating system, and these core numbers do have gaps on many Intel processors (even in cases where all of the cores are active, such as the 12-core Xeon E5 v3 (Haswell EP). You can see these gaps in the APIC IDs, which appear to be the source of KMP_AFFINITY's "core number".

Unfortunately the story is probably worse than this....

Volume 2 of the Xeon E5 v3/v4 Datasheet (documents 330784 and 333810, respectively) describe the "CAPID6" register in PCI configuration space that provides a bit map of the enabled cores. (Similarly, the CAPID5 register provides a bit map of the enabled L3 slices.) From Intel's published documents (e.g., the Xeon E5 v4 uncore performance monitoring guide, document 334291, Figure 1-2), it looks like the medium-sized Xeon E5 v4 products are based on a 15-core die. (The three columns of 4 cores from the 12-core Xeon E5 v3 were expanded to 3 columns of 5 cores each in the Xeon E5 v4.) I looked at 22 Xeon E5-2660 v4 14-core parts to see if they have the same patterns of APIC numbers and bit masks in the CAPID5 and CAPID6 registers.

Results:

All of the parts are missing "core 7" in the KMP_AFFINITY output.
This corresponds to APIC IDs of [14,15] and [46,47].
- I get the APIC ID from the EDX field returned by CPUID with an input of 0x0b in EAX.
  - The input value in ECX does not matter in this case -- the EDX field always returns the x2APIC ID.
- Socket 0 numbers have no offset while socket 1 numbers have 32 added.
  - The last digit is the thread context.
  - So these four (missing) values would map to:
    - APIC ID 14: Socket 0, Core 7, thread 0
    - APIC ID 15: Socket 0, Core 7, thread 1
    - APIC ID 46: Socket 1, Core 7, thread 0
    - APIC ID 47: Socket 1, Core 1, thread 1
Unfortunately the bit mask of enabled cores in CAPID6 shows 7 different patterns across these 22 chips.
- Bit 0: enabled in 21 of 22 processors
- Bits 1-2: enabled in all 22 processors
- Bit 3: enabled in 20 of 22 processors
- Bit 4: disabled in all 22 processors
- Bit 5: enabled in 19 of 22 processors
- Bit 6: enabled in 21 of 22 processors
- Bit 7: disabled in all 22 processors
- Bit 8: enabled in 21 of 22 processors
- Bit 9: enabled in 9 of 22 processors
- Bit 10: enabled in 21 of 22 processors
- Bits 11-15: enabled in all 22 processors
- Bit 16: disabled in all 22 processors
- Bit 17: enabled in all 22 processors
- Bis 18-23: disabled in all 22 processors

I have found no guidance from Intel on how to interpret the bit masks in CAPID5 and CAPID6 with respect to the layout of the cores and L3 slices on the die (and don't expect to get any), but this exercise suggests that either (1) the x2APIC ID is remapped so that it does not show the position of the physical core(s) that are disabled, or (2) that the CAPID5 and CAPID6 registers are not what they appear to be.

It is also clear (from other exercises) that the L3 slice numbering is remapped to hide the "holes". For most Xeon E5 v3/v4 processors, the MSRs used to access the CBo's use CBo numbers from 0 to N-1 on an N-core system, even if the die has more L3 slices that are disabled. (E.g., 10-core Xeon E5-2660 v3 is built on a 12-core die and exposes CBo's 0 through 9, no matter which CBo's are actually disabled on the chip.)

There is nothing wrong with any of this, except that it makes it extraordinarily difficult to use the RING_*_USED events available in all of the units of the uncore in Xeon E5 processors. I can measure the ring traffic at a box, but I have no idea where that box is located relative to other boxes, and have to use extensive microbenchmarks to try to determine the relative locations of the boxes -- and this "reverse engineering" has to be done independently on every processor.

SergeyKostrov · ‎01-19-2017

KMP_AFFINITY is a very simple to use and here is an example: KMP_AFFINITY=granularity=fine,proclist=[0,2,4,6],explicit,verbose However, it is Not a portable solution because it depends on Intel's OpenMP runtime libraries. Take into account that run-time OpenMP thread binding without usage of KMP_AFFINITY could be done as well since you could always get a native thread ID from an OpenMP thread with some number. I use both and the second one also allows to boost priorities of OpenMP threads from NORMAL to ABOVE_NORMAL, HIGH or REALTIME, and could be use with any (!) version of OpenMP API.