Solved: Re: Module tbb.dll using 10% of my cpu time -- Any tips on whe

Oren · ‎08-11-2009

Hello,

I have a rather simple TBB program that utilizes only tbb::parallel_for (andtbb::auto_partitioner) to iterate through a pre-computed (const) vector of computations. This should be embarassingly easy to parallelize -- there is very minor locking between the computations and very many (>10000) computations per step. The only difficulty I see is that the execution time of each task, however, varies wildly.

Unfortunately, when I profile (using Intel Parallel Studio -> Hotspots), I see that tbb.dll is using up almost 10% of the CPU time! This doesn't seem like a normal situation (?) so I'm assuming that either I've done something wrong or that the auto_partitioner or default grainsize are not well suited to my usage.

I don't have a lot of experience on where the appropriate place to start looking/tweaking is, so I would appreciate some guidance on how to proceed.

Thanks!

~Oren

PS. The 10% of CPU time spent in tbb.dll is dominated by these 2 entries.

[TBB Scheduler Internals] <- [TBB Scheduler Internals] <- endthreadex <- endthreadex <- BaseThreadInitThunk <- RtlUserThreadStart 73.9% tbb.dll

[TBB Scheduler Internals] <- endthreadex <- endthreadex <- BaseThreadInitThunk <- RtlUserThreadStart 20.0% tbb.dll

Alexey-Kukanov · ‎08-12-2009

Quoting - Oren

To answer your question about the distribution of grainsize more precisely, I have inserted a tbb::tick_count at the start and end of the interaction function (that is, the one that gets called by the operator() of the function object that is passed to parallel_for) and gathered a vector of the wall time (at least according to (end-begin).seconds() ). I've attached a log-log plot but I'll summarize -- almost half of the computations finish in < 10^-6 seconds. Less than 1% take more than 10^-5 seconds.

Well, now it seems the contrary to my initial guess is true, and there is not enough work in most iterations to justify a grainsize of 1. Auto_partitioner tries to balance the load but it probably generates too small tasks causing additional overhead.

Meanwhile you should find some suitable grain size. Let me suggest some reasonings.

You probably have an idea of what total time is required to process typical workload serially - let's call it T_work. You might have an idea of the number of cores on a typical target system, let's call it P. Ideally, parallel executiontime should be T_work/P, which is also the work time of every core. For load balancing, the amount of work per chunk should be limited by some portion of core work time: T_chunk <= c*T_work/P where c should be rather small;1/4 is probably about the biggestreasonable factor (e.g. auto_partitioner uses c=1/4 as the initial chunking factor, though applies it to iteration space per thread, not execution time per thread). On the other hand, T_chunk can not be bigger than grain size (GS) multiplied by worst execution time of a single iteration: T_chunk <= GS*t_max(the small t indicates that this is a single iteration time). Therefore we can set upper bound for grain size: GS <= c*(T_work/P)/t_max;a bigger grainsize will not provide enough load balancing opportunities.

For the lower bound analysis, another timing would be required, namely the execution time with single_partitioner and grainsize of 1 on a single thread. In this case, every iteration will run in a separate task, and all tasks are executed by a single thread. Sowe can think of execution time being T_work+T_ovh (useful work plus overhead). When GS iterations are executed as a single task, the overhead is reduced roughly by the factor of GS. Let's say you don't want the overhead being more than X% of the total work time; I would probably consider Xbeing 1to 5 percent. Thus we have T_ovh/GS <=X/100, or GS >= 100*T_ovh/X as the lower bound for grain size.

If there remains a range for grainsize to vary, you could search for more optimal value in that range.

By the way, auto_partitioner also respects grainsize specified by blocked_range, and will never split a chunk already smaller than that. Of course auto_partitioner's value is vastly reduced if you have to specify some grainsize for it, but still it may cut off some overhead by dynamically increasing effective size of chunks, though with the risk of going too far with that and hurting load balance.

View solution in original post

Alexey-Kukanov · ‎08-11-2009

Quoting - Oren

I have a rather simple TBB program that utilizes only tbb::parallel_for (andtbb::auto_partitioner) to iterate through a pre-computed (const) vector of computations. This should be embarassingly easy to parallelize -- there is very minor locking between the computations and very many (>10000) computations per step. The only difficulty I see is that the execution time of each task, however, varies wildly.

With parallel_for, the time spent inside TBB scheduler must be due to either too fine granularity (does not seem to be your case) or load imbalance (most likely your case).

From what you wrote, it sounds like simple_partitioner with grainsize of 1 might work very well, because it provides best possibilities for load balancing (which is important when execution time per iteration varies) and each task (or at least most of tasks) has enough computation on its own to justify invocation.

If you could provide more details about your app, I would appreciate; my interest is to look for problems with auto_partitioner and maybe to see if there is a work partitioning policy better suitable for cases like yours.

Alexey-Kukanov · ‎08-11-2009

Quoting - Alexey Kukanov (Intel)

With parallel_for, the time spent inside TBB scheduler must be due to either too fine granularity (does not seem to be your case) or load imbalance (most likely your case).

From what you wrote, it sounds like simple_partitioner with grainsize of 1 might work very well <...>

Well, on the second read I thinkI might be misinterpreted your writing, and there is not necessary enough work per a single iteration. Probably an experiment with simple_partitioner and grainsize of 1 will tell more.

How much the amount of work per iteration varies between extreme cases? Which cases dominate in a workload - mostly light work with a few heavy iterations, or the opposite, or mostly medium amount of work with some percentage of extremes?

Oren · ‎08-11-2009

Thanks for your interest!

First off, the program is a particle-based simulation where the bulk of the computation involves calculating the interactions between the particles. Because the interactions have finite range, many of the interaction calculations take a very small amount of time (essentially compute the distance and reject return) -- the remainder of the interactions take much longer.

To answer your question about the distribution of grainsize more precisely, I have inserted a tbb::tick_count at the start and end of the interaction function (that is, the one that gets called by the operator() of the function object that is passed to parallel_for) and gathered a vector of the wall time (at least according to (end-begin).seconds() ). I've attached a log-log plot but I'll summarize -- almost half of the computations finish in < 10^-6 seconds. Less than 1% take more than 10^-5 seconds.

Now I can see how this is a "nightmare case" for the auto_partitioner --- we need large blocks to minimize overhead but short blocks to make sure that one thread does not exhaust its work before the others.

Thanks again,

Oren

PS, Here's the log-time / log-prob distribution

Alexey-Kukanov · ‎08-12-2009

Quoting - Oren

To answer your question about the distribution of grainsize more precisely, I have inserted a tbb::tick_count at the start and end of the interaction function (that is, the one that gets called by the operator() of the function object that is passed to parallel_for) and gathered a vector of the wall time (at least according to (end-begin).seconds() ). I've attached a log-log plot but I'll summarize -- almost half of the computations finish in < 10^-6 seconds. Less than 1% take more than 10^-5 seconds.

Well, now it seems the contrary to my initial guess is true, and there is not enough work in most iterations to justify a grainsize of 1. Auto_partitioner tries to balance the load but it probably generates too small tasks causing additional overhead.

Meanwhile you should find some suitable grain size. Let me suggest some reasonings.

You probably have an idea of what total time is required to process typical workload serially - let's call it T_work. You might have an idea of the number of cores on a typical target system, let's call it P. Ideally, parallel executiontime should be T_work/P, which is also the work time of every core. For load balancing, the amount of work per chunk should be limited by some portion of core work time: T_chunk <= c*T_work/P where c should be rather small;1/4 is probably about the biggestreasonable factor (e.g. auto_partitioner uses c=1/4 as the initial chunking factor, though applies it to iteration space per thread, not execution time per thread). On the other hand, T_chunk can not be bigger than grain size (GS) multiplied by worst execution time of a single iteration: T_chunk <= GS*t_max(the small t indicates that this is a single iteration time). Therefore we can set upper bound for grain size: GS <= c*(T_work/P)/t_max;a bigger grainsize will not provide enough load balancing opportunities.

For the lower bound analysis, another timing would be required, namely the execution time with single_partitioner and grainsize of 1 on a single thread. In this case, every iteration will run in a separate task, and all tasks are executed by a single thread. Sowe can think of execution time being T_work+T_ovh (useful work plus overhead). When GS iterations are executed as a single task, the overhead is reduced roughly by the factor of GS. Let's say you don't want the overhead being more than X% of the total work time; I would probably consider Xbeing 1to 5 percent. Thus we have T_ovh/GS <=X/100, or GS >= 100*T_ovh/X as the lower bound for grain size.

If there remains a range for grainsize to vary, you could search for more optimal value in that range.

By the way, auto_partitioner also respects grainsize specified by blocked_range, and will never split a chunk already smaller than that. Of course auto_partitioner's value is vastly reduced if you have to specify some grainsize for it, but still it may cut off some overhead by dynamically increasing effective size of chunks, though with the risk of going too far with that and hurting load balance.

Oren · ‎08-13-2009

Quoting - Alexey Kukanov (Intel)

Well, now it seems the contrary to my initial guess is true, and there is not enough work in most iterations to justify a grainsize of 1. Auto_partitioner tries to balance the load but it probably generates too small tasks causing additional overhead.

Meanwhile you should find some suitable grain size. Let me suggest some reasonings.

You probably have an idea of what total time is required to process typical workload serially - let's call it T_work. You might have an idea of the number of cores on a typical target system, let's call it P. Ideally, parallel executiontime should be T_work/P, which is also the work time of every core. For load balancing, the amount of work per chunk should be limited by some portion of core work time: T_chunk <= c*T_work/P where c should be rather small;1/4 is probably about the biggestreasonable factor (e.g. auto_partitioner uses c=1/4 as the initial chunking factor, though applies it to iteration space per thread, not execution time per thread). On the other hand, T_chunk can not be bigger than grain size (GS) multiplied by worst execution time of a single iteration: T_chunk <= GS*t_max(the small t indicates that this is a single iteration time). Therefore we can set upper bound for grain size: GS <= c*(T_work/P)/t_max;a bigger grainsize will not provide enough load balancing opportunities.

For the lower bound analysis, another timing would be required, namely the execution time with single_partitioner and grainsize of 1 on a single thread. In this case, every iteration will run in a separate task, and all tasks are executed by a single thread. Sowe can think of execution time being T_work+T_ovh (useful work plus overhead). When GS iterations are executed as a single task, the overhead is reduced roughly by the factor of GS. Let's say you don't want the overhead being more than X% of the total work time; I would probably consider Xbeing 1to 5 percent. Thus we have T_ovh/GS <=X/100, or GS >= 100*T_ovh/X as the lower bound for grain size.

If there remains a range for grainsize to vary, you could search for more optimal value in that range.

By the way, auto_partitioner also respects grainsize specified by blocked_range, and will never split a chunk already smaller than that. Of course auto_partitioner's value is vastly reduced if you have to specify some grainsize for it, but still it may cut off some overhead by dynamically increasing effective size of chunks, though with the risk of going too far with that and hurting load balance.

Indeed.

Experimenting with the grainsize revealed, my program was so efficient as eliminating unnecessary computation (via neighborlisting and other contrivances) that I only had ~40k computations per step, when I thought it was closer to ~100k. Efficiently distributing to 4 cpus when the minimum grain size ought to be 10000 (~5 ms execution time per grain, overhead is ~0.5ms on my rig) is simply infeasible for such a light load.

I've learned my lesson -- make sure your task is appropriate to the level of parallelism you desire before complaining about the overhead incurred by the scheduler!

Thanks for your guidance, I feel like I understand partitioning and overhead sufficiently to

Alexey-Kukanov · ‎08-13-2009

Quoting - Oren

I've learned my lesson -- make sure your task is appropriate to the level of parallelism you desire before complaining about the overhead incurred by the scheduler!

Yes, parallelizing at higher level could possiblyhelp reaping more speedup. Combining it with other computation that can be executed concurrently could also be a good idea. But even if none of these is possible, you still might benefit from exploiting available parallelism, at least in terms of wallclock execution time, even though 10% of the time is not spent for useful work.

Module tbb.dll using 10% of my cpu time -- Any tips on where to start looking?