OpenMP slows down when *repeatedly* switching num_threads

FabioL_ · ‎11-11-2019

Consider (pseudocode):

for (t = 0; t < nsteps; t++)
{
  #pragma omp parallel for num_threads(N0)
  for (i = 0; i < L0; i++)
    ...

  #pragma omp parallel for num_threads(N1)
  for (i = 0; i < L1; i++)
    ...

}

Assume:

Affinity is set with the usual OpenMP standard env variables (OMP_PROC_BIND, OMP_PLACES)
KMP_HOT_TEAMS_MODE=1
KMP_BLOCKTIME sufficiently high
(... also played with other env vars, but w/o luck)

Now, if:

N0 == N1, then nothing weird
N0 != N1, then I get horrible performance degradation (and variations across runs too)

I'm using the intel suite 19.0.0.117 , so a fairly recent one

Reproducer: https://github.com/FabioLuporini/hpc-utils/tree/master/omp-mixed-numthreads -- I'm running this on a KNL

Context (why do I need this). Basically, in my app I have several loop nests, and some of them benefit from nested parallelism (at least on KNL). Other loops instead do not (cannot) use nested parallelism, but I'd want a large thread count for them -- at least larger than the outer OpenMP level in loop nests with nested parallelism

Any idea of what could go wrong here?

Thanks!

-- Fabio

jimdempseyatthecove · ‎11-11-2019

Maybe a basic understanding of what is going on with nested parallel regions will help.

On first occurrence of a parallel region, a thread pool of the specified number of threads is created.

Upon exit of this region, then entry of another (or same) parallel region (at outer most level), should the number of threads specified remain the same or be less than current pool size (for outer most/main thread) level then nothing happens with regards to thread pool of main thread. However, should this parallel region require additional thread(s) then an O/S call will be made to add these threads to the main thread's outer level thread pool. IOW a thread's pool, for a given level, can grow if necessary, but won't shrink until program exit.

Now then, should an outer most level parallel region be entered, which then contains a nested level parallel region, then each thread instantiating a nested parallel region will instantiate or reuse or expand its own thread pool for this nest level in the same manner as described above for the main threads outer level. Should an additional nesting of parallel region occur, then each thread doing so will instantiate or reuse or expand its own thread pool in the same manner as described above for the main threads outer level, ... Note, as an example, if there is one nest level, then the main thread has two thread pools, one for the outer level, and one for this nest level. The additional threads in the outer pool of the main thread each have one thread pool for this first nest level. As each nest level is entered, new pools are created.

Note, each thread pool has a different collection of threads other than for the thread that instantiated the parallel region (that thread is a member of the next higher nest level, or main thread if that be the case).

>>KMP_BLOCKTIME sufficiently high

Should the product of all the worst case nest level threads be less than or equal to the total number of available hardware threads then high block time may be OK. However when the product of all the worst case nest level threads be less than or equal to the total number of available hardware threads then you have a situation of oversubscription of threads (more software threads than available hardware threads), then you would (may) want a KMP_BLOCKTIME=0. The KMP_BLOCKTIME is the approximate amount of time the thread will Spinwait before it suspends itself (should the wait condition not be satisfied).

Jim Dempsey

jimdempseyatthecove · ‎11-11-2019

Disregard my post #2 as your code is not using nested parallel regions.

>>Reproducer...

Your test code is unrealistic. Each thread of each parallel region in your test is writing to a shared array, a single cell of the array (different int location for each thread) but where 16 cells of the array are within a same cache line. IOW each thread in each parallel region is performing virtually no work. You would not reasonably use parallel code to update an array of 32 or 64 integers. The cost of entering and exiting the parallel regions is not insignificant.

Jim Dempsey

jimdempseyatthecove · ‎11-11-2019

If your really want to test an unrealistic program, consider testing:

  gettimeofday(&start, NULL);
  #pragma omp parallel
  for (int n = 0; n < nsteps; n++)
  {
    // no implicit barrier on this loop
    #pragma omp for nowait num_threads(num_threads_fst)
    for (int i = 0; i < num_threads_fst; i++)
    {
      array_fst = num_threads_fst*num_threads_snd;
    }
    // implicit barrier on this loop
    #pragma omp for num_threads(num_threads_snd)
    for (int i = 0; i < num_threads_snd; i++)
    {
      array_snd = num_threads_fst*num_threads_snd;
    }
  }
  gettimeofday(&end, NULL);

Jim Dempsey

FabioL_ · ‎11-12-2019

Hi Jim,

Thanks for the prompt reply.

First of all, I updated the reproducer introducing array padding, such that different threads definitely work on different cache lines.

https://github.com/FabioLuporini/hpc-utils/blob/master/omp-mixed-numthreads/mixed_numthreads.c#L28

Actually, this is what I already had locally -- I had forgotten to push it. Sorry!

Now, about your comment:

IOW each thread in each parallel region is performing virtually no work. You would not reasonably use parallel code to update an array of 32 or 64 integers. The cost of entering and exiting the parallel regions is not insignificant.

Yes, this toy program is merely to quantify the alleged overhead due to switching between different num_threads across consecutive parallel loops. So, on purpose, the compute cost is essentially 0. The key point is: I'm interested in the relative performance of these two runs:

# First run
OMP_PROB_BIND=spread OMP_PLACES=threads KMP_HOT_TEAMS_MODE=1 ./mixed_numthreads 4 4
Now computing 100000 steps
Computed <512> in 0.691266 secs

# Second run
OMP_PROB_BIND=spread OMP_PLACES=threads KMP_HOT_TEAMS_MODE=1 ./mixed_numthreads 4 2
Now computing 100000 steps
Computed <128> in 9.289160 secs

If you look at the command lines, you'll see that the only difference between these two runs is that the second one executes the second parallel for with 2 threads, instead of 4. (any another number other than 4, less or grater than 4, would work as a reproducer).

The performance is an order of magnitude different, so something horribly wrong is going on here.

IOW, while I appreciate that the cost of entering a parallel region isn't zero, what I find difficult to explain here is the performance difference of these two runs.

Finally, your proposed changes to my toy program won't work, as `num_threads` goes with the `pragma omp parallel`, not with the `omp for`

error: num_threads clause is incompatible with directive
        #pragma omp for nowait num_threads(num_threads_fst)

But regardless of this, I'm not sure how would that help with the issue I'm trying to explain here?

Many thanks again for all your explanations.

jimdempseyatthecove · ‎11-12-2019

>>The performance is an order of magnitude different, so something horribly wrong is going on here.

Do you have VTune? If so, it should tell you the cause (with some digging around if you are inexperienced with VTune).

I suspect that cause may be "false sharing" of cache line.

... OR ...

If I were to guess, one possibility is the non-participating threads in the reduced thread count loop yielded (went to sleep), then needed to be awoken upon start of larger thread count loop.

As an experiment, remove the num_threads(...) clause on both loops. In your test case with dynamic scheduling (or, you may want to use static with chunk size of 1), each thread will iterate once, and the thread count for each loop is min(nThreads, loop count). You will have the same number of worker threads. The static schedule with chunk size of 1 will assure the same threads access the same location in the output array.

Jim Dempsey

jimdempseyatthecove · ‎11-12-2019

Also, you said in #1 you used KMP_BLOCKTIME sufficiently high. In post #2 you state

4 4 = 0.691266 seconds
4 2 = 9.28916 seconds

Is your KMP_BLOCKTIME larger than 700?

Jim Dempsey

FabioL_ · ‎11-12-2019

>> I suspect that cause may be "false sharing" of cache line.

I can check with vtune, but there can't be false sharing of cache lines here. The two loops write to different array, and the scheduling of each loop is static. Also, the innermost dimension of the written array is merely there for padding, and it's big enough such that the written locations end up in different cache lines (though the array should fit in a page)

>> If I were to guess, one possibility is the non-participating threads in the reduced thread count loop yielded (went to sleep), then needed to be awoken upon start of larger thread count loop.

That was my guess too initially

>> As an experiment, remove the num_threads(...) clause on both loops. In your test case with dynamic scheduling (or, you may want to use static with chunk size of 1), each thread will iterate once, and the thread count for each loop is min(nThreads, loop count). You will have the same number of worker threads. The static schedule with chunk size of 1 will assure the same threads access the same location in the output array.

I'm not using dynamic scheduling anywhere...default one is static, and it's exactly like static,1 as the size of the loop is equals to the number of threads used for that loop

>> Is your KMP_BLOCKTIME larger than 700?

those runs were with KMP_BLOCKTIME unset, hence I suspect 200 ?

jimdempseyatthecove · ‎11-12-2019

>> but there can't be false sharing of cache lines here. The two loops write to different array

Cache line identifiers within the cache do not hold the full address (/64) of the cache line. Instead, a simplified hash of the address is used. Depending on CPU design and page size selected, false sharing can occur between A and B when:

(A&(-64))%4096 == (B&(-64))%4096

IOW when data is separated by a multiple of 4096 bytes. The 4096 modulus value may change depending on cache design.

See: https://tech.labs.oliverwyman.com/blog/2013/10/08/cpu-cache-collisions-in-the-context-of-performance/

Jim Dempsey

Tamer_Assad · ‎02-06-2020

Hi FabioL,

You were basically measuring OpenMP thread creation and parallel region time.

The number iterations in the inner loops “num_threads_fst” and ” num_threads_snd” assumed to be small, Intel compiler auto vectorization will result in 1 or 2 iterations per loop.

If you want to validate this, try larger counts in the inner loops... large enough to hide the thread creation.

Best regards,

Tamer