Solucionado: Number of Processes under OpenMP

Vishnu · ‎08-14-2016

I have the following environment variable set in my machine with an i7-4790K (4 cores, 8 threads):

export OMP_NUM_THREADS=4

When I run the program compiled using ifort, a total of 5 processes are spawned, one 'master' & four 'slaves' with master at 400% CPU, 3 slaves at 100% CPU each and one of the slaves at 0% (CPU & time both 0). This is as seen below.

On the other hand, when I use gfortran, it only spawns 4 processes: 1 'master' and 3 'slave', with each slave at 100% CPU and master at 400%

Does anyone know why?

Also, a slightly related question; I've heard that for well written (properly parallelized) code, utilizing hyper-threads is pointless and sometimes detrimental. This is why I set it to spawn 4 processes. But does it matter which thread of a core is used? i.e., is one thread of a core the main one and the other auxiliary?

So should I set the KMP_AFFINITY or GOMP_CPU_AFFINITY environment variables?

TimP · ‎08-14-2016

With current Intel compilers

export OMP_PLACES=cores

export OMP_NUM_THREADS=4

will place 1 thread on each of your 4 cores, in case your scheduler isn't smart enough to do that automatically. Perhaps an up to date gfortran/libgomp for linux also will observe OMP_PLACES=cores. Each thread is allowed to shift between the 2 hyperthreads of a core; normally the overhead for shifting is negligible, and, if you are so unlucky as to have some other task occasionally taking 1 hyperthread, the linux scheduler is expected to give you the other thread context.

I assume you are seeing a difference according to which OpenMP library you have linked (-liomp5 or -lgomp implied by your compiler choice), but I don't know why they would appear different. I guess you are running a more modern version of top than I am familiar with for linux.

You can link a gfortran build with -liomp5 (with both gfortran and ifort paths set) in case you wish to separate the influence of compiler and OpenMP library.

If, like most Fortran applications, your performance is determined by full usage of floating point resources, you can expect 1 thread per core to give peak performance, aside from situations such as heavy use of divide/sqrt which may stall a thread, or possible frequent cache misses. There should be some documentation of why the MKL library chooses to run 1 thread per core unless you over-ride its defaults.

ifort will observe GOMP_CPU_AFFINITY if KMP_AFFINITY et al. aren't set, so you can experiment with choices of which exact hyperthreads you use. I wouldn't expect this to be worth much effort for any CPU since the Intel Westmere.

Ver la solución en mensaje original publicado

jimdempseyatthecove · ‎08-14-2016

The effects of hyper threading is application dependent. Some applications benefit, others do not.

You have a terminology issue. Your multi-threaded application is a single process (note, this word is not to be confused with processor), which uses OpenMP, which (in your case) instantiates a thread pool of 4 threads (the original main thread + 3 additional threads). OpenMP also creates a non-thread pool monitoring thread that consumes little time.

The first line in the screenshot referencing a.out is the process accumulated time, the 4 indented a.out's are the times of individual threads of the 4 additional threads of the process (3 of which are the non-main thread pool threads). Each additional thread is using 100%, 99%, 99% of its thread's cpu time. The main thread's time includes its logical processor resource time + the additional thread pool logical processor resource time.

KMP_AFFINITY can help, it is easy enough to experiment. The chart appears to indicate that 1 thread per core was scheduled. It is likely that in this case (for this program) KMP_AFFINITY would do neither good nor harm.

You will have to test your application (wall clock time) when other programs are in use. In those cases KMP_AFFINITY may be more beneficial.

Jim Dempsey

TimP · ‎08-14-2016

With current Intel compilers

export OMP_PLACES=cores

export OMP_NUM_THREADS=4

will place 1 thread on each of your 4 cores, in case your scheduler isn't smart enough to do that automatically. Perhaps an up to date gfortran/libgomp for linux also will observe OMP_PLACES=cores. Each thread is allowed to shift between the 2 hyperthreads of a core; normally the overhead for shifting is negligible, and, if you are so unlucky as to have some other task occasionally taking 1 hyperthread, the linux scheduler is expected to give you the other thread context.

I assume you are seeing a difference according to which OpenMP library you have linked (-liomp5 or -lgomp implied by your compiler choice), but I don't know why they would appear different. I guess you are running a more modern version of top than I am familiar with for linux.

You can link a gfortran build with -liomp5 (with both gfortran and ifort paths set) in case you wish to separate the influence of compiler and OpenMP library.

If, like most Fortran applications, your performance is determined by full usage of floating point resources, you can expect 1 thread per core to give peak performance, aside from situations such as heavy use of divide/sqrt which may stall a thread, or possible frequent cache misses. There should be some documentation of why the MKL library chooses to run 1 thread per core unless you over-ride its defaults.

ifort will observe GOMP_CPU_AFFINITY if KMP_AFFINITY et al. aren't set, so you can experiment with choices of which exact hyperthreads you use. I wouldn't expect this to be worth much effort for any CPU since the Intel Westmere.

Vishnu · ‎08-14-2016

jimdempseyatthecove wrote:

OpenMP also creates a non-thread pool monitoring thread that consumes little time.

I see, I didn't know that. So I guess gfortran doesn't do that. Thanks!

jimdempseyatthecove wrote:

You will have to test your application (wall clock time) when other programs are in use. In those cases KMP_AFFINITY may be more beneficial.

Oh, so you're saying that, say, if I were using the machine that I am running this program on for other things, then it may end up putting two processes on the same core, thereby leading to sub-optimal run-times for the program. Is that it? Hmmm... this is my usage case, and so I suppose setting that value will help. Thanks again!

Tim P. wrote:

export OMP_PLACES=cores

Ah! Yes, I suppose this is what I want. This seems to be a part of the openmp 4.0 spec, and gcc 4.9 supports it. Thanks!

Tim P. wrote:

I assume you are seeing a difference according to which OpenMP library you have linked (-liomp5 or -lgomp implied by your compiler choice), but I don't know why they would appear different. I guess you are running a more modern version of top than I am familiar with for linux.

You can link a gfortran build with -liomp5 (with both gfortran and ifort paths set) in case you wish to separate the influence of compiler and OpenMP library.

I think you may be right. I will try it out with libraries swapped. I am using htop, not top. It is easier to read off of. Try it.

Tim P. wrote:

If, like most Fortran applications, your performance is determined by full usage of floating point resources, you can expect 1 thread per core to give peak performance, aside from situations such as heavy use of divide/sqrt which may stall a thread, or possible frequent cache misses. There should be some documentation of why the MKL library chooses to run 1 thread per core unless you over-ride its defaults.

Then I suppose using one thread per core is appropriate for me. There is very little division and square-rooting (three instances of division and one or none of SQRT). I will experiment when I have some more time. Thanks a lot!

jimdempseyatthecove · ‎08-18-2016

>>Oh, so you're saying that, say, if I were using the machine that I am running this program on for other things, then it may end up putting two processes on the same core, thereby leading to sub-optimal run-times for the program. Is that it?

No. With affinity pinning, the O/S is still free to pick the same hardware thread as (one of) your thread(s), or a different hardware thread on the same core, or any other core/HT. However, a well behaved thread scheduler will tend to place threads elsewhere. Note, in the case of your application using all hardware threads there is no "elsewhere".

The purpose of the affinity pinning is that the O/S will not relocate your thread, thus making any L1/L2 loads performed no longer in your thread's cache. Meaning it takes longer to refetch from RAM. If a different process (program) or even different thread of your process preempts the pinned thread, then the pinned thread suspends, and most likely the L1 and L2 will get completely reused by the preempting thread.

Jim Dempsey

Vishnu · ‎08-18-2016

jimdempseyatthecove wrote:

The purpose of the affinity pinning is that the O/S will not relocate your thread, thus making any L1/L2 loads performed no longer in your thread's cache. Meaning it takes longer to refetch from RAM. If a different process (program) or even different thread of your process preempts the pinned thread, then the pinned thread suspends, and most likely the L1 and L2 will get completely reused by the preempting thread.

Okay, so L1/L2 caches are local to a core, and the only ways for another process to reuse info on it would be to preempt a running thread, or access the other thread in the core. Is that right? In my application, each instance needs a different data set, so I suppose it will have to fetch from L3 or RAM anyway.

jimdempseyatthecove · ‎08-19-2016

>>and the only ways for another process to reuse info on it would be to preempt a running thread, or access the other thread in the core. Is that right?

An additional way is for some other thread within your application to perform a write to the same location (thus evicting your copy from your core's L1/L2).

.OR.

If your thread, sibling thread, or other thread within your process (or possibly inter-process), writes to a cache line who's address hash matches the address hash in your L1/L2 thus forcing an eviction from your L1/L2 cache. Note, this is an unnecessary eviction from your view point since the addresses differ, but necessary from the hardware due to it using what is called set association via hashes because the hashes collide, as opposed to using a content addressable memory where the key is the original physical address.

Jim Dempsey