Solved: Poor auto_partitioner performance?

e4lam · ‎11-28-2011

Hi,

I just changed from using TBB 3.0 update 6 to TBB 4.0 update 1 (OSS commercial aligned releases). In some tests, I've noticed that my parallel_for()'s don't seem to be parallelizing anymore. Basically, on my recent Ubuntu distribution, the test I have is now 6 times slower on a 6 core machine. By explicitly specifying the grain size via a simple_partitioner, I was able to get similar performance again to TBB 3.

Looking at the include files, it seems that TBB 4 has a new auto_partitioner that might be the cause of this serious problem. The only possibly non-standard thing I can think of is that during my application startup, I might do:

[bash]static tbb::task_scheduler_init theTaskScheduler(tbb::task_scheduler_init::deferred,0);
static inline void
reset_task_scheduler()
{
    if (theTaskScheduler.is_active())
        theTaskScheduler.terminate();
    theTaskScheduler.initialize(get_num_processors()/*=6*/, 0);
}

[/bash]

So any ideas? I know it's being initialized correctly because simple_partitioner works.

Thanks!

Anton_M_Intel · ‎12-01-2011

Hi, sorry you faced the issue. We have really changed the implementation of auto partitioning algorithm in order to handle highly imbalanced work better and owe you a blog describing the change (CHANGES shortly mention this), sorry.

To tackle load balance we have to partition the work into smaller pieces but if processing a piece requires high constant overhead, such a partitioning leads to a lot of overhead.Unfortunately, this is a contradiction which probably cannot be solved within one algorithm, or without external hint.

The new auto partitioner algorithm still creates basically the same amount of tbb::tasks, but in order to get better responsiveness, it further splits the range *inside* a task and executes it by small pieces by invoking functor multiple times per task. In your case, it seems reached the grainsize limit. However, the algorithm is still able to aggregate the iterations into ranges of size > 1 in order to handle huge ranges with small work per iteration where constant overhead for calling a functor becomes significant.

So, how is your parallel_for's Body (tbb_functor) implemented? Does it contain something outside of the loop over the range? And what is the range (100, 1000 iterations?)

We also aware that parallel_reduce with huge split/join operations may be hurt as well. I hope to discuss all this stuff in more details in the blog.

View solution in original post

e4lam · ‎11-28-2011

The problem seems to be how auto_partitioner now behaves with respect to a default grain_size of 1. All I'm doing is:

tbb::parallel_for(tbb::blocked_range(begin, end), tbb_functor, tbb::auto_partitioner());

I can get it fast again by increasing the default grain_size to roughly (end - begin)/num_cores. It seems like the auto_partitioner is now creating many more tasks?

e4lam · ‎11-28-2011

PS. AFAICT, the documentation (ie. Table 10) says that it should be ok to use a default grain size of 1 for auto_partitioner.

RafSchietekat · ‎11-28-2011

"By explicitly specifying the grain size via a simple_partitioner, I was able to get similar performance again to TBB 3."
All partitioners, including auto_partitioner, should obey a range's grainsize (by way of is_divisible()).

How long is the initial range, and have you made a log and maybe even a summary of the lengths of the executed subranges with both TBB versions? That would allow us to better evaluate the change.

If the length of the initial range is sufficiently larger than the number of cores by order of magnitude, you should indeed be able to forget about grainsize with auto_partitioner. It would be difficult to get rid of that condition altogether, and maybe something went wrong in an attempt to improve on it.

e4lam · ‎11-30-2011

The functor only gets ranges of size 1. This really looks to be behavior change due to the revamped auto-partioner. The question is whether this was intentional.

RafSchietekat · ‎11-30-2011

Would you also tell us the length of the initial range?

(Added) And can you increase it to a level where you do see nontrivial chunks being executed?

Anton_M_Intel · ‎12-01-2011

Hi, sorry you faced the issue. We have really changed the implementation of auto partitioning algorithm in order to handle highly imbalanced work better and owe you a blog describing the change (CHANGES shortly mention this), sorry.

To tackle load balance we have to partition the work into smaller pieces but if processing a piece requires high constant overhead, such a partitioning leads to a lot of overhead.Unfortunately, this is a contradiction which probably cannot be solved within one algorithm, or without external hint.

The new auto partitioner algorithm still creates basically the same amount of tbb::tasks, but in order to get better responsiveness, it further splits the range *inside* a task and executes it by small pieces by invoking functor multiple times per task. In your case, it seems reached the grainsize limit. However, the algorithm is still able to aggregate the iterations into ranges of size > 1 in order to handle huge ranges with small work per iteration where constant overhead for calling a functor becomes significant.

So, how is your parallel_for's Body (tbb_functor) implemented? Does it contain something outside of the loop over the range? And what is the range (100, 1000 iterations?)

We also aware that parallel_reduce with huge split/join operations may be hurt as well. I hope to discuss all this stuff in more details in the blog.

e4lam · ‎12-02-2011

The ranges vary in size between 6567 and 19376. Is it expected that the TBB 4 auto_partitioner will divide all the way into the finest granularity possible before calling the task functor for these ranges?

As a secondary question, is there some way for me to call the old TBB 3 partitioner using TBB 4? I see that the old auto_partitioner still exists in the header files but I'm not sure how to invoke it.

Thanks!

Anton_M_Intel · ‎12-05-2011

Hmm... The ranges are big enough to be partitioned into size>1 ranges, at least initially. Let's assume you have 16 HW threads, then 4*16 = 64 tasks will be created initially, each start from at least 6567/64/16 = ~6 iterations per functor invocation. It does not explian what you see, the work can beimbalanced that the partitioner starts to make smaller pieces but it can happen only after initial partitioning.

Please answer also to my previous question -Does your functor contain something outside of the loop over the range?

You answers will help us to improve the Reference at least, or even improve the partitioning algorithms.

Thanks!

Anton_M_Intel · ‎12-05-2011

Answering your second question, the old_auto_partitioner is for sakeof parallel_scan only. There is no way to enable old code like defining a macro. But it should be safe to just include parallel_for and partitioner.h headers from the TBB 3. Of course, not mixing them with TBB 4 in one compilation module.

e4lam · ‎12-12-2011

Sorry for the late reply as I've been on vacation. We're not doing "constant" work per task invocation, but work that is overlapped, ie. if the functor gets a range of size N, it has to do N + X(N) where X(N) is some extra amount of work dependent on N. So, it still ends up that the algorithm we're using ends up performing poorly when the initial range is broken into many more task invocations than before.

I think the TBB3 auto_partitioner still has a niche in the TBB4 world.

Alexey-Kukanov · ‎12-30-2011

As Raf suggested, you still might use grainsize with the auto_partitioner. It tries to keep granularity coarser, and in any case does not make it finer than a half of the specified grainsize.
What makes me wonder though is that you mentioned that only grainsize of / gives performance back. That suggests that your workload(s) almost do not benefit from making granularity any finer, and so from auto_partitioner. So maybe coarse fixed grainsize is what you need?

e4lam · ‎12-31-2011

Yes, we already realized a grainsize is what's needed for use with TBB 4. I think the problem was that the change caught us by surprise. Thanks, Alexey!

RafSchietekat · ‎01-05-2012

#11 "What makes me wonder though is that you mentioned that only grainsize of / gives performance back. That suggests that your workload(s) almost do not benefit from making granularity any finer, and so from auto_partitioner. So maybe coarse fixed grainsize is what you need?"
Grainsize should never depend on number of threads available. This particular specification gives each thread only O(1) chunk to execute, and then it may have to sit idly by while other threads are still working on the problem, because it won't be able to steal more work. Instead, it should be a fixed number, to be tuned by doing experiments.

#12 "Yes, we already realized a grainsize is what's needed for use with TBB 4. I think the problem was that the change caught us by surprise. Thanks, Alexey!"
It's only a workaround, since auto_partitioner is supposed to make a grainsize superfluous. The underlying problem may still need attention.

e4lam · ‎01-05-2012

It's only a workaround, since auto_partitioner is supposed to make a grainsize superfluous. The underlying problem may still need attention.

I've probably done a bad job at explaining my situation. I think the two main factors resulting in my poor performance were:
- There's a large penalty for Body invocations when they're given small iterations. I suspect that even at ~6 iterations per Body, it imposed a huge overhead for my particular algorithm.
- The Body may do much more work on some parts of the range than other parts (ie. poorly balanced). My second suspicion is that the TBB4 auto_partitioner may have quickly subdivided into single iteration ranges because of this.

In anycase, Anton pointed out that with the TBB4 auto_partitioner, it needs some application hint to tackle the load balancing problem. It seems to me that specifying a grain size as such a hint is probably natural to the TBB library. It's good practice to always test and specify a grain size anyhow for performance reasons.

Having said that, I still think the TBB3 auto_partitioner was a good idea where it created a minimal number of Body objects to load balance. Personally, I would like it to be still available, perhaps under a new name.

RafSchietekat · ‎01-05-2012

I don't know how to respond to that other than by repeating myself.