- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
I have problem with waiting on "OMP Join Barrier" and I can't find an answer anywhere. Most of the time my code spends on waiting on the barrier. The simplified code is:
[bash]#pragma omp parallel default(none) shared(mcomp, bestVecFinal, bestSumFinal, cerr) { // initialize structures //... #pragma omp for collapse(2) schedule(dynamic) for(uint32_t ifirst = 0; ifirst < getHeight(); ifirst++) { for(uint32_t isecond = 0; isecond < getHeight(); isecond++) { if (ifirst > isecond) continue; // hack to allow collapse, because normally isecond should start with ifirst value // .. do the work... } // end for isecond } // end for ifirst #pragma omp critical { // select best result from the ones computed by each thread } // end critical } // end parallel [/bash]
VTune Amplifier shows that most of the time (~75%) it waits on OMP Lock Barrier at the end of the execution.. I tried loops with or without collapse, with different schedule and still the same. Any clues? How to identify the bottleneck?
I have this issue on MTL, where it spawns 80 threads. Here you can see a screenshot from VTune with the issue (I have no "Lock and Wait" screenshot right now, unfortunately). There is a period with intensive computing, and later idle thread joining.
Update:
I noticed, that even if I comment-out all the code and leave only #pargma omp parallel {/*empty, no loops*/} (just create threads and exit) I get large wait times:
Average Concurrency: 1.195
Elapsed Time: 1.849
CPU Time: 10.000
Wait Time: 68.586
Executing actions 100 % done
If the problem is slow joining of the threads, is there any way to overcome this?
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
[cpp]uint32_t h = getHeight(); uint32_t nWorkItems = (h * h + h) / 2; #pragma omp for for(uint32_t i = 0; i < nWorkItems; ++ i) { ifirst = 0; isecond = i; for(uint32_t j=h; j>0; --j) i { isecond -= j; if(isecond < 0) break; ++ifirst; } isecond += j + ifirst; ...your process code here }[/cpp]
I will leave it an exercize for you to remove the ifirst, isecond computation loop.
table look-up might be fastest.
Please report your Average Concurrency, Elapsed, CPU and Wait times.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Jim Dempsey
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page