Intel® oneAPI Threading Building Blocks
Ask questions and share information about adding parallelism to your applications when using this threading library.

Explicit chunk size

I am trying to use an explicit chunk size using a simple partitioner and parallel_for

[cpp]int grainsize = m_numCentroids / numThreads; grainsize = grainsize > 0 ? grainsize : 1;
parallel_for(blocked_range(0, m_numCentroids, grainsize), destinationRunner, tbb::simple_partitioner()); [/cpp]

then in destinationRunner::operator() I print out the range it is iterating over, I expect this to be equal to grainsize as the tutorial gives the description of simple_partitioner as "Chunk size = grainsize".

However this is not the case and I get the following output

m_numCentroids = 1084
numThreads = 5
grainsize = 216

Running from 0 to 135
Running from 542 to 677
Running from 813 to 948
Running from 948 to 1084
Running from 677 to 813
Running from 271 to 406
Running from 135 to 271
Running from 406 to 542

Obviously what I'm trying to do here is explicitly create the exact number of threads and equally distribute the work between them, however the partitioner isn't behaving as expected! So I augment this with a call to the

[bash]tbb::task_scheduler_init init(numThreads);[/bash]

However this didn't really change the chunksize, just the number of threads that the chunks would run on. So what I get is exactly the same as above except that instead of running chunks of 135 elements on 8 threads simultaneously, it runs chunks of 135 on 5 threads followed by another 3 threads after the first 5 finish!!! Really not what I want. What I want it 5 threads running chunks of 216 elements. Could someone tell me what I'm doing wrong?

Using tbb22_20090809oss

0 Kudos
5 Replies
What you want is static partitioning, which is easy to do with OpenMP parallel loopsbut not with TBB's parallel_for, particularly for the number of threads not being power of 2.

Would you mind telling why you think you need static partitioning?

You could emulate static partitioning using task_group class which helps you to create exactly as many tasks as you need and distribute the work across the tasks the way you wish. But in general, this approach is less efficient than parallel_for that utilizes dynamic load balancing via work stealing.

thanks for the quick response. I'm looking for static partitioning so that I can easily allow the user to control the number of threads and more importantly so that I can generate some graphics to show off how well TBB parallelises our algorithm with a variable number of threads :)

Currently the graph only has 4 real data points, {1,2,4,8} threads. I've managed to get this to work by implementing my own inherited explicit_range class from blocked_range and simply modifying the spliting constructor to split off non-symetric ranges with an explicit size (grain size).

[bash]    //Value middle = r.my_begin + (r.my_end-r.my_begin)/2u; // blocked_range
    Value new_end = r.my_begin + r.my_grainsize;            // explicit_range[/bash]
Obvsiosly this wont work well unless your grainsize is equal to your number of work elements divided by your number of threads but as I'm controlling both these parameters explicitly it should generate suitable speed improvements (and is).

Quoting jamiecook
I'm looking for static partitioning so that I can easily allow the user to control the number of threads and more importantly so that I can generate some graphics to show off how well TBB parallelises our algorithm with a variable number of threads :)

Actually, you don't need static partitioning for either of this.
To control the number of participating threads, use task_scheduler_init class.
To ensure TBB scales well with the growing number of threads, create enough of parallel slack (i.e. orders of magnitude more tasks / iteration subranges than there are threads).

What we usually do in scalability studies like that is to throttle the number of threads via task_scheduler_init, and just use default parameters of parallel_for. In many benchmarks itscaled nearly linear. And when it did not, overhead on creation of parallel slack was rare the reason (in particular because default settings of parallel loops in TBB 2.2 are chosen to avoid unnecessary task creation).

Except that I don't want to create many orders of magnitude more tasks than threads :)

The tasks are quite large and there is a bit of overheard in iterating over a range so I would rather portion them out myself. Not a lot of difference admitedly, but I want to have the control to do this if I want to.

Well, I guess I was notabsolutely correct saying about orders of magnitude more tasks :) The point is that you should allow it, but not create manually.And you don't even have to do anything for that; TBB default settings for parallel_for were selected to avoid excessive task creation while keeping good load balance. There are corner cases where it leaves enough performance on the table, but in most cases, it works just fine. Particularly, I think you'd be surprised if you count Body objects created during the run; it should be much less than the number of iterations in the loop.

These explanations are not to convert you into my belief :) but rather to let other readersknow that what you do is usually unnecessary. The primary idea of TBB is to make its users care *less* about work partitioning, job scheduling, thread pool management etc. - ideally, not care at all - and still get good performance and scalability.