Intel® Moderncode for Parallel Architectures
Support for developing parallel programming applications on Intel® Architecture.

OMP Join Wait issue



I have problem with waiting on "OMP Join Barrier" and I can't find an answer anywhere. Most of the time my code spends on waiting on the barrier. The simplified code is:

[bash]#pragma omp parallel default(none) shared(mcomp, bestVecFinal, bestSumFinal, cerr)
  // initialize structures

#pragma omp for collapse(2) schedule(dynamic)
  for(uint32_t ifirst = 0; ifirst < getHeight(); ifirst++) {
    for(uint32_t isecond = 0; isecond < getHeight(); isecond++) {
      if (ifirst > isecond) continue; // hack to allow collapse, because normally isecond should start with ifirst value
      // .. do the work...  
    } // end for isecond
  } // end for ifirst
#pragma omp critical
  // select best result from the ones computed by each thread  
  } // end critical
} // end parallel

VTune Amplifier shows that most of the time (~75%) it waits on OMP Lock Barrier at the end of the execution.. I tried loops with or without collapse, with different schedule and still the same. Any clues? How to identify the bottleneck?

I have this issue on MTL, where it spawns 80 threads. Here you can see a screenshot from VTune with the issue (I have no "Lock and Wait" screenshot right now, unfortunately). There is a period with intensive computing, and later idle thread joining.


I noticed, that even if I comment-out all the code and leave only #pargma omp parallel {/*empty, no loops*/} (just create threads and exit) I get large wait times:

Average Concurrency: 1.195
Elapsed Time: 1.849
CPU Time: 10.000
Wait Time: 68.586
Executing actions 100 % done

If the problem is slow joining of the threads, is there any way to overcome this?

0 Kudos
2 Replies
Black Belt
Try something along the line of
[cpp]uint32_t h = getHeight();
uint32_t nWorkItems = (h * h + h) / 2;
#pragma omp for
for(uint32_t i = 0; i < nWorkItems; ++ i)
   ifirst = 0;
   isecond = i;
   for(uint32_t j=h; j>0; --j)			i
      isecond -= j;
      if(isecond < 0) break;
   isecond += j + ifirst;
   ...your process code here

I will leave it an exercize for you to remove the ifirst, isecond computation loop.
table look-up might be fastest.
Please report your Average Concurrency, Elapsed, CPU and Wait times.

Jim Dempsey
Black Belt
I should mention that your "problem" wasyouriteration space [0:h*h) was nearly half occupied by null work items. The parallel for (with or without collapse and/or dynamic scheduling) assumes each iteration has approximately the same work. The revised code in my prior posthas the characteristic of each iteration having approximately the same work (excepting for a small amount of increasing work as i grows, but which can be eliminated with table look-up or formula that may be faster).

Jim Dempsey