<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic OMP Join Wait issue in Intel® Moderncode for Parallel Architectures</title>
    <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/OMP-Join-Wait-issue/m-p/796931#M474</link>
    <description>&lt;P&gt;Hello,&lt;/P&gt;&lt;P&gt;I have problem with waiting on "OMP Join Barrier" and I can't find an answer anywhere. Most of the time my code spends on waiting on the barrier. The simplified code is:&lt;/P&gt;&lt;P&gt;&lt;PRE&gt;[bash]#pragma omp parallel default(none) shared(mcomp, bestVecFinal, bestSumFinal, cerr)
{
  // initialize structures
  //...

#pragma omp for collapse(2) schedule(dynamic)
  for(uint32_t ifirst = 0; ifirst &amp;lt; getHeight(); ifirst++) {
    for(uint32_t isecond = 0; isecond &amp;lt; getHeight(); isecond++) {
      if (ifirst &amp;gt; isecond) continue; // hack to allow collapse, because normally isecond should start with ifirst value
      // .. do the work...  
    } // end for isecond
  } // end for ifirst
#pragma omp critical
  {
  // select best result from the ones computed by each thread  
  } // end critical
} // end parallel
[/bash]&lt;/PRE&gt; &lt;/P&gt;&lt;P&gt;VTune Amplifier shows that most of the time (~75%) it waits on OMP Lock Barrier at the end of the execution.. I tried loops with or without collapse, with different schedule and still the same. Any clues? How to identify the bottleneck?&lt;/P&gt;&lt;P&gt;I have this issue on MTL, where it spawns 80 threads. &lt;A href="http://imageshack.us/photo/my-images/526/screenshotes.png/"&gt;Here&lt;/A&gt; you can see a screenshot from VTune with the issue (I have no "Lock and Wait" screenshot right now, unfortunately). There is a period with intensive computing, and later idle thread joining.&lt;/P&gt;&lt;P&gt;Update:&lt;/P&gt;&lt;P&gt;I noticed, that even if I comment-out all the code and leave only #pargma omp parallel {/*empty, no loops*/} (just create threads and exit) I get large wait times:&lt;/P&gt;&lt;P&gt; Average Concurrency:  1.195&lt;BR /&gt; Elapsed Time:         1.849&lt;BR /&gt; CPU Time:             10.000&lt;BR /&gt; Wait Time:            68.586&lt;BR /&gt; Executing actions 100 % done&lt;/P&gt;&lt;P&gt;If the problem is slow joining of the threads, is there any way to overcome this?&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
    <pubDate>Tue, 08 Nov 2011 09:07:50 GMT</pubDate>
    <dc:creator>jmikians</dc:creator>
    <dc:date>2011-11-08T09:07:50Z</dc:date>
    <item>
      <title>OMP Join Wait issue</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/OMP-Join-Wait-issue/m-p/796931#M474</link>
      <description>&lt;P&gt;Hello,&lt;/P&gt;&lt;P&gt;I have problem with waiting on "OMP Join Barrier" and I can't find an answer anywhere. Most of the time my code spends on waiting on the barrier. The simplified code is:&lt;/P&gt;&lt;P&gt;&lt;PRE&gt;[bash]#pragma omp parallel default(none) shared(mcomp, bestVecFinal, bestSumFinal, cerr)
{
  // initialize structures
  //...

#pragma omp for collapse(2) schedule(dynamic)
  for(uint32_t ifirst = 0; ifirst &amp;lt; getHeight(); ifirst++) {
    for(uint32_t isecond = 0; isecond &amp;lt; getHeight(); isecond++) {
      if (ifirst &amp;gt; isecond) continue; // hack to allow collapse, because normally isecond should start with ifirst value
      // .. do the work...  
    } // end for isecond
  } // end for ifirst
#pragma omp critical
  {
  // select best result from the ones computed by each thread  
  } // end critical
} // end parallel
[/bash]&lt;/PRE&gt; &lt;/P&gt;&lt;P&gt;VTune Amplifier shows that most of the time (~75%) it waits on OMP Lock Barrier at the end of the execution.. I tried loops with or without collapse, with different schedule and still the same. Any clues? How to identify the bottleneck?&lt;/P&gt;&lt;P&gt;I have this issue on MTL, where it spawns 80 threads. &lt;A href="http://imageshack.us/photo/my-images/526/screenshotes.png/"&gt;Here&lt;/A&gt; you can see a screenshot from VTune with the issue (I have no "Lock and Wait" screenshot right now, unfortunately). There is a period with intensive computing, and later idle thread joining.&lt;/P&gt;&lt;P&gt;Update:&lt;/P&gt;&lt;P&gt;I noticed, that even if I comment-out all the code and leave only #pargma omp parallel {/*empty, no loops*/} (just create threads and exit) I get large wait times:&lt;/P&gt;&lt;P&gt; Average Concurrency:  1.195&lt;BR /&gt; Elapsed Time:         1.849&lt;BR /&gt; CPU Time:             10.000&lt;BR /&gt; Wait Time:            68.586&lt;BR /&gt; Executing actions 100 % done&lt;/P&gt;&lt;P&gt;If the problem is slow joining of the threads, is there any way to overcome this?&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 08 Nov 2011 09:07:50 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/OMP-Join-Wait-issue/m-p/796931#M474</guid>
      <dc:creator>jmikians</dc:creator>
      <dc:date>2011-11-08T09:07:50Z</dc:date>
    </item>
    <item>
      <title>OMP Join Wait issue</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/OMP-Join-Wait-issue/m-p/796932#M475</link>
      <description>Try something along the line of&lt;BR /&gt;&lt;PRE&gt;[cpp]uint32_t h = getHeight();
uint32_t nWorkItems = (h * h + h) / 2;
#pragma omp for
for(uint32_t i = 0; i &amp;lt; nWorkItems; ++ i)
{
   ifirst = 0;
   isecond = i;
   for(uint32_t j=h; j&amp;gt;0; --j)			i
   {
      isecond -= j;
      if(isecond &amp;lt; 0) break;
      ++ifirst;
   }
   isecond += j + ifirst;
   ...your process code here
}[/cpp]&lt;/PRE&gt;&lt;BR /&gt;&lt;BR /&gt;I will leave it an exercize for you to remove the ifirst, isecond computation loop.&lt;BR /&gt;table look-up might be fastest.&lt;BR /&gt;Please report your Average Concurrency, Elapsed, CPU and Wait times.&lt;BR /&gt;&lt;BR /&gt;Jim Dempsey</description>
      <pubDate>Wed, 09 Nov 2011 16:58:58 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/OMP-Join-Wait-issue/m-p/796932#M475</guid>
      <dc:creator>jimdempseyatthecove</dc:creator>
      <dc:date>2011-11-09T16:58:58Z</dc:date>
    </item>
    <item>
      <title>OMP Join Wait issue</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/OMP-Join-Wait-issue/m-p/796933#M476</link>
      <description>I should mention that your "problem" wasyouriteration space [0:h*h) was nearly half occupied by null work items. The parallel for (with or without collapse and/or dynamic scheduling) assumes each iteration has approximately the same work. The revised code in my prior posthas the characteristic of each iteration having approximately the same work (excepting for a small amount of increasing work as i grows, but which can be eliminated with table look-up or formula that may be faster).&lt;BR /&gt;&lt;BR /&gt;Jim Dempsey</description>
      <pubDate>Wed, 09 Nov 2011 17:05:05 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/OMP-Join-Wait-issue/m-p/796933#M476</guid>
      <dc:creator>jimdempseyatthecove</dc:creator>
      <dc:date>2011-11-09T17:05:05Z</dc:date>
    </item>
  </channel>
</rss>

