Intel® oneAPI Threading Building Blocks
Ask questions and share information about adding parallelism to your applications when using this threading library.

How to flush thread local buffers at end of parallel loop

Walter_D_
Beginner
574 Views

I have the following problem: a loop over a large number of objects (indices if you like), for some of which (small fraction, but randomly spread), some expensive but vectorizable work must be done. So, my current approach is to buffer those extensive indices until vector-size have accumulated, then flush the buffer by performing the work via the vector instructions. Of course, each thread needs its own buffer, which can be organized as a thread-local object. However, the problem now is that when the loop finishes, some of the buffers will not be empty, but need some final flushing. I can do that, of course, serially, but I wonder whether it is possible that each thread flushes its buffer at the very end of the loop, i.e. as a final task when no more loop-related tasks are available.

How could this be achieved? Or is there already a pattern that supports this (ie. via reduction)?

0 Kudos
3 Replies
Alexei_K_Intel
Employee
574 Views

Hi Walter,

You may want to consider the tbb::combinable interface that provides TLS functionality with a possibility to merge per-thread values after computations (see tbb::combinable::combine and tbb::combinable::combine_each methods).

Regards, Alex

0 Kudos
Walter_D_
Beginner
574 Views

Thanks Alex,

This appears to be equivalent to tbb::enumerable_thread_specific::combine_each(). Unfortunately, the tbb documentation (at least the one I could find) is so poor, that it's unclear (from the link you provided) whether these works in parallel or not, I suspect not. With tbb::enumerable_thread_specific::range() in conjunction with tbb::parallel_for(), one can at least achieve that.

However, this is my fallback solution. What I'm asking for here is something better, i.e. avoiding the barrier between the first loop and the parallel flushing of the buffers. So, unfortunately, your idea is of little help.

Cheers, Walter.

 

0 Kudos
Alexei_K_Intel
Employee
574 Views

I see the point. Thank you for explanation. There are several ways how it can be achieved with Intel TBB. The simplest variant but probably not the most efficient is to use static_partitioner for the parallel loop. So the algorithm does not need any TLS, it process objects as usual and at the end of the task flushes the reminder. The main problem here is load-balancing because static_partitioner cannot balance the work across the threads. 

Another approach is to use a combination of tbb::combinable and dynamic memory. The algorithm can use combinable (or any other TLS) inside each task, run additional tasks and process the reminders serially when the parallel loop is finished but other threads can process additional tasks at that time. Consider the example:

tbb::task_group tg;
tbb::combinable tls;
tbb::parallel_for( tbb::block_range(0,N), [] ( range ) {
    my_tls = tls.local();
    if ( !my_tls ) my_tls = create_buffer();
    for ( range ) {
        if ( object is found ) {
            my_tls.add( index );
            if ( my_tls.size() is enough ) {
                 // pass the pointer to the buffer to the additional task
                 tg.run( [buffer = my_tls.data()] { 
                     // do "vectorizable" work
                     // deallocate buffer
                 } );
                 // create a new buffer for the further indices
                 my_tls = create_buffer();
​            }
        }
    }
} );
// parallel loop is finished so it safe to process combinable and create additional tasks
tls.combine(...);
// wait for the additional tasks
tg.wait();

The main problem of this approach is that the worker threads will tend to process just created tasks instead of tasks of parallel loop. However, I do not know if it is a big issue or not. Theoretically, you can try to use static partioner and do balancing with additional tasks. It is difficult to say what is the best solution for you because it depends on the amount of work of parallel loop and "vectorizable" work and relative cost of memory allocations. To reduce overheads of memory allocations you can try to use tbbmalloc.

Regards, Alex

0 Kudos
Reply