OpenMP 4 nested parallelism thread placement

Michael_B_17 · ‎07-20-2015

I have a question to ask about some of the environment variables to get thread placement working properly on the phi with nested parallelism. I've looked around the Intel website and various other sources, and talked to some people at Intel, but I can't quite get it working properly.

Essentially, I want to run my code using only 3 threads per core (assuming 60 cores) because testing has shown that it gets the best performance. Without nested parallelism I could get 3 threads per core with these variables:

export KMP_AFFINITY=compact
export KMP_PLACE_THREADS=60c,3t
export OMP_NUM_THREADS=180

Since then I've made changes to my code to improve data locality, and I want to try and use nested parallelism to help more. I still want 3 threads per core at the 'lowest' level, and to spread across the cores at the 'top' level. I can successfully do this on a dual socket, 8 core CPU setup using OpenMP 4.0 like this:

export OMP_NUM_THREADS=2,8
export KMP_HOT_TEAMS_MAX_LEVEL=2
export OMP_PROC_BIND=spread,close
export OMP_NESTED=TRUE
export OMP_PLACES=cores (or threads)

which places 2 threads on cores 0/8 at the top level (outer parallel region), then fills in 1-7 and 9-15 at the second level (inner parallel region). Somebody advised me about the KMP_HOT_TEAMS_MAX_LEVEL=2 setting, which seems to improve performance a lot (but KMP_HOT_TEAMS_MODE didn't make a difference). This gives me the same or slightly better performance as before.

If I try to use something similar on the Xeon Phi the performance is a bit abysmal:

export OMP_NUM_THREADS=60,3
export KMP_HOT_TEAMS_MAX_LEVEL=2
export OMP_PROC_BIND=spread,close
export OMP_NESTED=TRUE
export OMP_PLACES=threads$240$

(About the OMP_PLACES setting - if I just use OMP_PLACES=threads then it uses the OS core as well and performance is obviously even worse. I looked at https://software.intel.com/en-us/node/510366 and it suggested using 'threads(x)' to limit the number of cores used, but the parentheses is a special character in bash and this is inside a script so I need to escape it (or put it inside escaped quotes). Why isn't this specified as 'threads,240' like the other OpenMP variables...?)

I've tried messing around with the 'cores' placements as well but I can't figure out the optimal settings. I've tried running it through vtune but nested parallelism on the Xeon Phi seems to generate an enormous amount of profiling events and it fills up the maximum data limit of 500MB in about 2 seconds, and provides no information about the actual computation in my code.

Looking at the output of KMP_AFFINITY=verbose is also confusing because it mixes in the KMP_AFFINITY messages with the OMP_PROC_BIND messages, which sometimes contradict each other with regards to which processes are bound to which threads. From a quick analysis of this output it does seem to be binding only one OpenMP thread per hardware thread, but I don't know if it's binding them in the right place.

Ideally I would also want to be able to set OMP_NUM_THREADS to be '30,3' and have it only use half the device, but when I do that it doesn't seem to pay attention to the 'close' thread placement and the threads at the lowest level of parallelism 'spill out' onto other cores, again making performance worse and removing any data locality.

The only examples I can find of people using this is 'toy' examples using a CPU, has anyone got anything like this working properly on a Xeon Phi?

jimdempseyatthecove · ‎07-20-2015

Enter the work Phalanx into the search box near the top of this page. You should find a series of blog articles titled: "The Chronicles of Phi - part n - The Hyper-Thread Phalanx ..."

This illustrates an example of core teaming. This technique did not use nested OpenMP directives. Rather it created a parallel region then divided the work by core, then again by HT within the core. This can either be performed implicitly via coordination of environment variables and code, or, explicitly by using CPUID to obtain the core and thread associations (leaving free "compact" or "scatter", ... to best suit the remainder of the program).

Jim Dempsey

Michael_B_17 · ‎07-21-2015

I actually have the 'High Performance Parallelism Pearls' book with me here and I just gave that chapter a read again. That behaviour is what I'm trying to emulate with the nested parallelism, where a group of threads on each core can work 'independently' of other cores, especially removing the need to do a global synchronisation across the whole device at the implicit barrier at the end of a parallel region. As mentioned in the book, this has the advantages of better data locality/better cache hit rate, which I have also complimented with some tiling.

There's a few problems on my end though, mainly that my code is in Fortran so the parts of the plesiochronous barrier code that rely on things like _mm_pause() are unavailable to me. The structure of the code is also very different, with the tiling and some MPI communication meaning that I can't stay inside the 'inner' parallel regions at all times.

The OpenMP 4.0 nested parallelism should make this doable through environment variables instead, and the core placement does seem to be correct when I look at it with KMP_AFFINITY=verbose, I just don't know why the performance seems to be a lot worse than expected.

For example, at the outer parallel region:

OMP: Info #242: KMP_AFFINITY: pid 87137 thread 0 bound to OS proc set {1}
OMP: Info #242: OMP_PROC_BIND: pid 87137 thread 1 bound to OS proc set {5}
OMP: Info #242: OMP_PROC_BIND: pid 87137 thread 2 bound to OS proc set {9}
OMP: Info #242: OMP_PROC_BIND: pid 87137 thread 3 bound to OS proc set {13}
OMP: Info #242: OMP_PROC_BIND: pid 87137 thread 4 bound to OS proc set {17}
OMP: Info #242: OMP_PROC_BIND: pid 87137 thread 5 bound to OS proc set {21}
...

Then at the inner level:

...
OMP: Info #242: OMP_PROC_BIND: pid 87137 thread 74 bound to OS proc set {6}
OMP: Info #242: OMP_PROC_BIND: pid 87137 thread 75 bound to OS proc set {7}
OMP: Info #242: OMP_PROC_BIND: pid 87137 thread 76 bound to OS proc set {46}
OMP: Info #242: OMP_PROC_BIND: pid 87137 thread 77 bound to OS proc set {47}
...

(Not in order, because I think in the OpenMP spec it says that the thread placement is implementation defined? If I manually specify thread affinity using OMP_PLACES='{0:1}:240:1' then it behaves the same anyway.)

Each core gets assigned 3 threads (eg, threads 1, 74, and 75 are on the second core) which should all be working on the same tile, and therefore data that is adjacent in memory. Actually using 3 (or 2, or 4) threads per core makes it go a lot slower. This seems to suggest that it's maybe not actually placing threads on the correct cores, but checking with omp_get_thread_num() and omp_get_thread_ancestor_num(), all threads in one of the inner parallel regions are definitely working on the same tile.

As this approach works fine on the CPU, I'm wondering if this is a bug in the OpenMP runtime or if I just need to set my environment variables differently to get the desired behaviour.

jimdempseyatthecove · ‎07-21-2015

The _mm_pause and KNC _mm_delay_32 or _mm_delay_64 can be written in C and called from Fortran. That is not an issue.

The CPUID and CPUIDEX C code can also be called. Therefore the C code that used the CPUID to obtain the Core number and HT sibling numbers and stores that into the thread local storage (of Fortran) can be constructed as well. Thus you only need to affinity pin 1, 2, 3, or 4 HT's per core, then let the code figure out associations.

You should be able to rework the HyperThreadPhalanx.c (sources are referenced in book) into a HyperThreadPhalanx.f90 routine that makes the calls to C primitives to perform things you cannot do directly from Fortran (__cupid and __cpuidEX and your new function calling WAIT_A_BIT).

This then reduces your problem to:

!$OMP PARALLEL
! All threads start here
CALL YourDoTileWork(YourTLScoreNumber, YourTLSHTNumber, YourNcores, YourNhtsPerCore)
!$OMP END PARALLEL

Then you would partition the workspace into tiles, the number of which is dependent on number of cores and/or core cache size, and/or LLC cache size. Then sub-divide the core tile based on number of HTs per core.

You should be able to call MPI from within the parallel region. With HT 0 of the core team making the call, while the other threads of the team use SLEEPQQ or OpenMP lock routines. Keep in mind that you can tree-structure your locks: HT's within core, core within socket, ... such that if (when) you want only the master thread of the parallel region to make the MPI call that you can coordinate the barrier (assuming the !$OMP BARRIER is in adequate).