>>...I have written a

mjc · ‎05-24-2013

I have written a function that incurs a tremendous amount of overhead in [OpenMP dispatcher] called by [OpenMP fork] called on behalf of a particular parallel region of mine, according to VTune. That fork accounts for roughly a third of all CPU time in my program. My code is as follows. My intention is to have two parfor loops running concurrently.

#pragma omp parallel
{
   #pragma omp for
   for( int y =0; y< h; y++)
   {
       // somthing fairly time consuming
   }
   #pragma omp for
   for( int x=0 ;x<w; x++)
   {
       // something else time consuming
   }
}

I already know that the function is worth parallelizing in this way because my whole program performs much more poorly if I comment out the pragmas or insert a num_threads(1) clause. Also, I have tried changing the code to use two consecutive parallel regions each containing one parfor. That version performs significantly worse than the version shown, as well.

The two time-consuming sections of code take approximately the same amounts of time, within about +/-20%. There are other threads executing at the same time. VTune says that there is no oversubscription and in fact CPU usage is well below the target of 12 for my 6-core hyperthreaded i7. Windows Task Manager reports CPU usage around 85%.

I would appreciate any suggestions about what this fork/dispatcher overhead is and how it can be reduced.

mjc · ‎05-24-2013

I've realized that the two parfors do not execyte concurrently, unless I include a nowait clause on the first one. My question about overhead remains the same.

TimP · ‎05-24-2013

Did you set KMP_AFFINITY?

Does it run better if you adjust the number of threads and adjust KMP_AFFINITY?

If it's a Westmere style CPU, you could consider variations such as 8 threads with 2 threads only on the cores which have their own path to L3, 6 threads with 1 per core, and 4 threads with just 1 thead on each pair of cores which share those internal paths.

mjc · ‎05-24-2013

Interesting suggestion -- I have not experimented with affinity yet. I guess I should try that.

Still, I am wondering about all this overhead being accounted for by the openmp fork, and not code my code in the loop. Would that be the case if affinity were my only problem?

SergeyKostrov · ‎06-02-2013

>>...I have written a function that incurs a tremendous amount of overhead in [OpenMP dispatcher] called by [OpenMP fork] called on >>behalf of a particular parallel region of mine, according to VTune.

Please provide more information, a test-case ( reproducer ), a screenshot of VTune with description of the problem. Complete VS and VTune test projects will help to understand what could be possibly wrong.

>>...
>>for( int y = 0; y < h; y++ )
>>...

It is Not clear how big your data set is.

robert-reed · ‎06-05-2013

Let me ask a question here. Is there any reason you are using OpenMP to cast your initial thread split but Intel TBB for the rest of the parallelization, rather than say, using a TBB parallel-invoke to do the initial task splitting and then carry on with Intel TBB through the rest of the parallel stack? The reason I ask is that while Intel's OpenMP and Intel TBB can coexist and TBB will defer to OpenMP if oversubscription is occurring, there is extra overhead incurred by using both threading models (not to mention multiple thread pools). You might be able to get by with lower overhead by just using Intel TBB throughout. It might look something like this:

[cpp]

for ( int k=0; k< 3; k++) {
tbb::parallel_invoke( []{ tbb::parallel_for( myBigSlowFunctor); }, [] { tbb::parallel_for( myOtherBigSlowFunctor); } )

}

[/cpp]

Not only does this eliminate the overhead of two parallelization libraries competing with each other, but also removes that #pragma omp parallel from within the outer loop and some potential thrash in OpenMP overhead. Intel TBB is also more composable than OpenMP (an older parallelization model that has its advantages in some cases but makes assumptions about ownership of the whole machine that causes some trouble with multiple layers of parallelization).

mjc · ‎06-06-2013

Thie mix of tbb and omp is partly historical and partly because omp is syntactically easier unless you use lambdas with tbb (not available with all compilers). So, this is mainly an OpenMP program into which I've experimentally inserted a little bit of tbb. That said, Robert's suggestion seems worth trying.

mjc · ‎06-06-2013

I followed Robert's suggestion and now my most time-consuming function is called exclusively by tbb parfors inside a a tbb parallel_invoke. The throughput is not significantly different, compared to the hybrid omp/tbb code.

Here's what sticks out for me in the VTune concurrency report.

1. almost half the cpu time is in my set_image_p function, which is the main thing called by the nested-parallel code we've been discussing.

2. set_image_p is shown as running oversubscribed, which I suppose is because, in addition to about 11 tbb threads doing that, there are also about a half-dozen omp threads trying to do other things on other frames.

3. the second biggest "hotspot" is [openmp dispatcher] still taking about 25% of all cpu time, which astounds me

4. although the thread concurrency histogram ranges up to 21 fairly often, the cpu usage histogram shows usage rarely goes above 9, and averages just 5.

5. there is a lot of spin time, especially in tbb.

How can I reduce overhead and improve throughput?

robert-reed · ‎06-06-2013

Have you tinkered with KMP_BLOCKTIME? This environment variable controls how long an OpenMP thread will spin before going to sleep, a hedge against a soon-to-follow parallel construct after the current one. Intel TBB will defer to a spinning OpenMP thread, so when running in a mixed system such as yours, reducing it from its default (which at one time was 200, as in millisecconds) will let the TBB threads start up sooner. It should be a saddlepoint with a sweet spot in the middle, not too long to keep TBB from waiting, but not too short to avoid OpenMP needlessly putting threads to sleep unecessarily often.

mjc · ‎06-06-2013

I have been using kmp_set_blocktime( 0 ), which makes no significant difference compared to the default. kmp_set_blocktime( 1 ) is much worse. I also have OMP_WAIT_POLICY=PASSIVE.

robert-reed · ‎06-06-2013

Hmmm. By my understanding, an OMP_WAIT_POLICY of PASSIVE would seem to me to nullify the effect of KMP_BLOCKTIME. The fact that it still has an effect in your experiments breaks my mental model of how things are supposed to work. Or perhaps the KMP_BLOCKTIME=1 overrides the OMP_WAIT_POLICY=PASSIVE (a least for a millisecond) and that spinning is enough to delay TBB? Is "much worse" in the 30% range or the 4X range ("kind'a worse" or "really much worse")? I'll have to consult with the developers and get a better understanding of how these policies interact.

Given that KMP_BLOCKTIME has so little an effect, could the excessive spin be due to load imbalance issues? Are all your gangs of threads finishing their tasks at the same time?

mjc · ‎06-06-2013

It's a mystery. Omp spin time, while undesirable, is not as high as tbb spin time in this program. And they are both dwarfed by omp overhead time.

I can reduce the amount of oversubscription by reducing the task_scheduler_init parameter. That reduces the amount of tbb spin time but doesn't help throughput.

Btw,the api function kmp_set_blocktime( 1 ) almost cuts the throughput of my whole program in half and causes spin+overhead time to exceed 50% of all cpu time used.

Elapsed Time:   26.942s
Overhead Time:   30.385s
Spin Time:   87.310s
CPU Time:   207.588s

Maybe I should just ask "what counts as overhead in openmp?" Because the overhead attributed to [OpenMP dispatcher] (as called by [OpenMP fork]) is the single most annoying thing in my VTune output.

robert-reed · ‎06-07-2013

Any idea how hard you;re pressing on the available memory bandwidth? With so many threads operating simultaneously and oversubscribing, perhaps there's some back pressure from the memory hierarchy? It sounds from your description like the data delivery from the OpenMP portion is the limiting factor, and adding extra TBB threads just raising spin time wihout raising throughput is consistent with being saturated in some resource.

jimdempseyatthecove · ‎06-07-2013

Before we proceed further wasting your (and our) time, I need to ask you a question.

In your post on 6/6/2013 13:46 you stated:
>>2. set_image_p is shown as running oversubscribed, which I suppose is because, in addition to about 11 tbb threads doing that, there are also about a half-dozen omp threads trying to do other things on other frames.

Your most likely problem is due to oversubscription .AND. mis-use of threading model. Not only do you have a "Who's on first" issue with mixing TBB and OpenMP, you've aggrivated the SpinWait time by oversubscription of OpenMP to impliment what appears to be a pthread-like event/wait driven thread model using OpenMP threads together with OpenMP pooled thread model. This is a recipie for disaster.

Just what are these other omp threads doing on other frames? (same thing?, I/O?, compute-only?, SpinWaiting for input from the output of the section of code you disclosed/outlined?, other?)

Jim Dempsey

mjc · ‎06-07-2013

At this point, my situation is a little different from what I first described. Jim is correct that my pragma omp parallel is within a loop. On the other hand, I am absolutely sure that the body of the parallelized iteration is very time consuming. Performance goes way down if I serialize it; and VTune confirms that a ton of time is in code that it calls. I am now looking at code like the following.

for ( int k=0; k< 3; k++)
{
   #pragma omp parallel sections // num_threads(2)
   {
       #pragma omp section
       {
           tbb::parallel_for( myBigSlowFunctor )
       }

       #pragma omp section
       {
           tbb::parallel_for( myBigSlowFunctor )
       }
   }
}

But I haven't told you enough yet. If you don't mind reading, here is an overview of what's going on.

We have multiple levels of parallelization in this video-processing application. At the top level, there is a custom-made pool of 4 threads, each of which processes one frame at a time. This gives a lot of speedup compared to single-threaded runs, because each frame requires a lot of processing and very little synchronization between one frame and another. I have experimented with different numbers of frame-level threads. Performance increases with larger numbers of threads up to 4 or more. Task manager CPU utilization is around 90% but VTune’s cpu usage histogram makes it look like usage is much less than that, despite thread concurrency being very high.

Within each frame-processing thread there are about 5 openmp parallel regions. Each of them specifies just two tasks (loop iterations or sections). In addition, this function calls some functions that in turn invoke some parallelism. Admittedly, this sounds like it could be too much but I figure it’s OK to start by exposing “too much” parallelism and then tune to a realistic range.

The most interesting part of the call-graph is where the frame-processing thread calls a computing function which has the two omp sections outined above. The functor ultimately calls the set_image_p function that takes so much time according to VTune. It is also the main use of tbb in this application, so it is presumably responsible for a large amount of the reported spin time. If I use serial code in place of the tbb parfor, throughput is much lower. If I remove the omp parallel sectioning around it, throughput goes down significantly but not dramatically. I do have a critical section around the computing function. This helps by preventing the pairs of big tbb parfors from competing with each other. They do compete with other work for other frames, though.

Vtune summary output is attached. Obviously, the cpu (a 6-core i7-970) looks to be oversubsubscribed. I have tried different settings for omp_set_num_threads and tbb::task_scheduler_init. Kmp_set_blocktime(0) helped. I feel like I’m missing a methodology for improving throughput by reducing overhead.

mjc · ‎06-07-2013

I appreciate you generous folks spending time on this. I'm sure there are other things you could be doing. I've had much success with OpenMP on simpler programs in the past, so it's interesting to see the issues that arise in more complex situations.

I've created a fully de-parallelized version of my program. It runs about 1 fps and gives me a baseline to see where overhead and spin time are coming from. In the baseline, we have effectively zero spin and overhead time, as would be expected.

It turns out, I didn't have to do much to get from there to a version where 25% of CPU time is designated as "overhead" by VTune. Just reenabling one omp parfor was all it took. It turns out, though, that VTune is simply wrong. It mistakenly designates time spent in my "function x" as overhead attributable to [omp fork]. I don't know why this happens, but it does take my "most disturbing thing" off the table.

Going back to the fully de-parallelized version and reenabling just the one most important tbb loop, I find that about 7% of all cpu time is tbb spin time, if I allow 6 tbb threads to be utilized. The number is smaller if I don't allow more than 1 tbb thread. This, again, is without any omp or other time-consuming threads running. Sounds like a grain size problem, maybe, so I am looking into that. Also, Robert's point about memory is a reasonable one.

Meanwhile, my reason for having a multi-level threading model is simple: I can't realistically get the performance I need without it. The processing within a single frame is not parallelizable enough to beat Amdahl, so I need to overlap processing of one frame with others. I can't process more than a few frames at a time, though, because that incurs too much latency. Hence, two levels of parallelism. I am not commited to doing more than two levels -- I want to do everything that helps and nothing that doesn't help!

jimdempseyatthecove · ‎06-08-2013

>> It mistakenly designates time spent in my "function x" as overhead attributable to [omp fork]. I don't know why this happens

fork is calling your task, fork portion of time is fork-time less call time of your task and any subsiquent nesting or additional calls in fork that you wish not to attribute to fork.

If you have subroutine foo, that calls subroutine fee then foo's time includes the runtime of the code in foo + the runtime of fee (+ anything it calls).

>> I find that about 7% of all cpu time is tbb spin time

You are assuming this is overhead, instead this is wait time. The tbb threads cannot do work because the other sections of (serial) code has passed out of the tbb parallel region and has not yet looped back to the tbb parallel region. When your serial code passes through the tbb parallel region (e.g. parallel_for or parallel_invoke, ...) some tbb threads complete before others. These threads then have nothing to do but wait for the remaining tbb threads to complete their slice of for or task before exiting the parallel region, once you exit the parallel region the serial portion of the app can produce more data for the parallel region to work on. More threads waiting == more apparent "overhead" when this is not overhead at all (other than from power consumption viewpoint). What is important to your application is the latency between the start of the parallel region and the completion of (all threads in) the parallel region.

Note now that because you have only one running tbb parallel region at a time, that the threads performing that region have nothing to do after completing their portion of the parallel region and thus enter a spin wait. Had you had two tbb parallel regions running at a time, then the threads finishing up in one of these regions are avalable to complete unassigned tasks of the second parallel region no spinwait time for that/those threads at least until the consume all the tasks of the second (last concurrent) parallel region..

What is your fps for

Serial (1fps)
1 TBB thread (should be less than 1fps)
2 TBB threads
...
HW # TBB threads

Look at the effects on fps, not a false metric of spin-wait time.

Now then, spin-wait time is important. You can use it as an indicator of "free time", and not as an indication of "lost time" or "overhead". Your job then is to craft a way to use this free time. This generally means increasing the tasking, as was done with the parallel_invoke(parallel_for, parallel_for).

Now lets further inspect the 7% "overhead" in spin-wait. Is this a true picture of unutilized CPU time? No, you also should look at the system idel time (or aggrigate time not spent computing by all threads in your app). This is an indicator of available processing resources you can apply to your problem.

>>so I need to overlap processing of one frame with others

This is typically performed using double-buffering or n-buffering (n=1,2,3,4...), or ring buffering. You should also look at using parallel_pipeline.

With either n-buffering or parallel_pipeline you can select the number of buffers (1, 2, 3, ...). Now you can tune for latency or throughput.

Latency reduction techniques can be imployed such as a ring buffer. Such that as a buffer is filled on one end it is concurrently processed and emptied at the other end. When setup as a ring buffer, the buffer becomes a stream device between the producer(s) on one end and the consumer(s) on the other end. If necessary, multiple ring buffers can be employed.

Jim Dempsey

mjc · ‎06-13-2013

I agree, absolutely, that throughput (FPS) is the only true performance metric. I look at utilization, spin time, etc. as diagnostic metrics only to indicate possible avenues for improvement. My frame-level threading is a straightforward implementation of the n-buffering strategy you described, which I have been using for years.

Addressing your question about throughput vs number of threads, here is a table that also includes reported spin time. The first 10 rows are from runs that do NOT do the n-buffering, in order to focus on the inherent scalability of the one time-consuming function I care most about. The last line is from a frame-parallel run, for perspective.

"max serial": 1.1, 1.4%
p_for, no p_invoke, 1 thread: 1.1, 1.6%
p_for, no p_invoke, 2 threads: 1.4, 2.9%
p_for, no p_invoke, 4 threads: 1.6, 5.3%
p_for, no p_invoke, 6 threads: 1.7, 8.3%
p_for, no p_invoke, 12 threads: 1.7, 13.4%
p_for+p_invoke; 4 threads: 1.7, 4.1%
p_for+p_invoke; 5 threads: 1.7, 5.6%
p_for+p_invoke; 6 threads: 1.7, 6.7%
p_for+p_invoke; 12 threads: 1.7, 12.7%
p_for+p_invoke, 4 frame threads; 4 tbb threads: 4.5, 3.7%

Numbers like these initially made me think it didn't scale well beyond 2-4 threads. In truth, though, it does better than indicated. Using other diagnostics, I was able to determine that the function of interest gets faster with more tbb threads; it just doesn't impact the overall FPS number when it's given more than 2 threads.

So my main problems are solved, and they turned out to be measurement issues. It was interesting to learn from the experts about tbb, omp, and vtune.

jimdempseyatthecove · ‎06-13-2013

And what is it for:

6 frame threads; 6 tbb threads?
4 frame threads; 8 tbb threads?
8 frame threads; 4 tbb threads?

Jim Dempsey

mjc · ‎06-13-2013

It clearly likes more frame threads, but latency is a problem for the application.

p_for+p_invoke, 6 frame threads; 6 tbb threads: 5.1, 4.8%
p_for+p_invoke, 4 frame threads; 8 tbb threads: 4.3, 5.6%
p_for+p_invoke, 8 frame threads; 4 tbb threads: 5.3, 3.5%

jimdempseyatthecove · ‎06-14-2013

If frame rate and low latency is important, then it might be time for you to invite someone capable to look at your code for things you've overlooked. All the readers of this forum, code sight unseen, can only offer generalized help. If you need specific help, you will have to show specific code (possibly the whole application).

Jim Dempsey

parallel parallel for overhead in OpenMP