Intel® Moderncode for Parallel Architectures
Support for developing parallel programming applications on Intel® Architecture.
Announcements
FPGA community forums and blogs have moved to the Altera Community. Existing Intel Community members can sign in with their current credentials.

OMP Join Wait issue

jmikians
Beginner
1,073 Views

Hello,

I have problem with waiting on "OMP Join Barrier" and I can't find an answer anywhere. Most of the time my code spends on waiting on the barrier. The simplified code is:

[bash]#pragma omp parallel default(none) shared(mcomp, bestVecFinal, bestSumFinal, cerr)
{
  // initialize structures
  //...

#pragma omp for collapse(2) schedule(dynamic)
  for(uint32_t ifirst = 0; ifirst < getHeight(); ifirst++) {
    for(uint32_t isecond = 0; isecond < getHeight(); isecond++) {
      if (ifirst > isecond) continue; // hack to allow collapse, because normally isecond should start with ifirst value
      // .. do the work...  
    } // end for isecond
  } // end for ifirst
#pragma omp critical
  {
  // select best result from the ones computed by each thread  
  } // end critical
} // end parallel
[/bash]

VTune Amplifier shows that most of the time (~75%) it waits on OMP Lock Barrier at the end of the execution.. I tried loops with or without collapse, with different schedule and still the same. Any clues? How to identify the bottleneck?

I have this issue on MTL, where it spawns 80 threads. Here you can see a screenshot from VTune with the issue (I have no "Lock and Wait" screenshot right now, unfortunately). There is a period with intensive computing, and later idle thread joining.

Update:

I noticed, that even if I comment-out all the code and leave only #pargma omp parallel {/*empty, no loops*/} (just create threads and exit) I get large wait times:

Average Concurrency: 1.195
Elapsed Time: 1.849
CPU Time: 10.000
Wait Time: 68.586
Executing actions 100 % done

If the problem is slow joining of the threads, is there any way to overcome this?

0 Kudos
2 Replies
jimdempseyatthecove
Honored Contributor III
1,073 Views
Try something along the line of
[cpp]uint32_t h = getHeight();
uint32_t nWorkItems = (h * h + h) / 2;
#pragma omp for
for(uint32_t i = 0; i < nWorkItems; ++ i)
{
   ifirst = 0;
   isecond = i;
   for(uint32_t j=h; j>0; --j)			i
   {
      isecond -= j;
      if(isecond < 0) break;
      ++ifirst;
   }
   isecond += j + ifirst;
   ...your process code here
}[/cpp]


I will leave it an exercize for you to remove the ifirst, isecond computation loop.
table look-up might be fastest.
Please report your Average Concurrency, Elapsed, CPU and Wait times.

Jim Dempsey
0 Kudos
jimdempseyatthecove
Honored Contributor III
1,073 Views
I should mention that your "problem" wasyouriteration space [0:h*h) was nearly half occupied by null work items. The parallel for (with or without collapse and/or dynamic scheduling) assumes each iteration has approximately the same work. The revised code in my prior posthas the characteristic of each iteration having approximately the same work (excepting for a small amount of increasing work as i grows, but which can be eliminated with table look-up or formula that may be faster).

Jim Dempsey
0 Kudos
Reply