Intel® oneAPI Threading Building Blocks
Ask questions and share information about adding parallelism to your applications when using this threading library.
2471 Discussions

enumerable_thread_specific and nested parallel_for



I'm working on the following problem: I need to iterate over a 3D grid and create 8 std::vector<TInfo>s, where TInfo instances represent a set of operations. This is kind of a coloring algorithm where each of the 8 vectors corresponds to a color, and all the TInfo instances in the same vector can be  processed concurrently without data races. This been said, I can do something like the following (not necessarelly valid c++ code):


std::vector<TInfo> tinfoVectors[8];

obtainTInfoVectors(grid,tinfoVectors); // This is a sequential procedure

for (int i = 0; i < 8; i++)

    parallel_for(blocked_range<size_t>(0,tinfoVectors.size()), FtorProcessTInfo(tinfoVector));


I want to parallelize obtainTInfoVectors. This function goes through all the grid cells and creates 8 TInfo​ instances per cell, and pushes the instances into the right tinfoVectors. This means that in a multi-threaded implementation, different threads will try to add elements at the same time to tinfoVectors​ creating data races. Two solutions come to my mind:

  1. Replace std::vector with tbb::concurrent_vector: the problem is that obtainTInfoVectors​ very intensivelly adds elements to the vectors, and I don't really expect to have much gain using tbb::concurrent_vector. Thus, I discarded this option.
  2. To use tbb::enumerable_thread_specific to have a tinfoVectors per thread. This, I beleive, is more efficient.

Using  tbb::enumerable_thread_specific, the code can change to something like this:


typedef std::vector<TInfo> TInfoVectors[8];

enumerable_thread_specific<TInfoVectors> tinfoVecsTLS;


for (int i = 0; i < 8; i++)

    for (t = tinfoVecsTLS.begin(); t != tinfoVecsTLS.end(); t++) //LOOP1

        parallel_for(blocked_range<size_t>(0,(*t).size()), FtorProcessTInfo((*t)));//LOOP2


However, since parallel_for has some implicit synchronization, I'm afraid that LOOP1 can introduce unnecessary overhead. If I could flatten LOOP1 and LOOP2 into one loop and perform a parallel_for would be ideal. I'm aware of flattened2d​, but since it only supports forward iterator is not suited for parallel_for, and using parallel_do instead would result in unnecessary overhead also. So I'm thinking of make LOOP1 parallel as well, and get something like the following:


struct FtorProcessTInfoVecTLS{

    typedef enumerable_thread_specific<TInfoVectors>::const_range_type range_t;

    int i;

    FtorProcessTInfoVecTLS(int i_p)

    : i(i_p)


    void operator()(range_t & r) const {

        // the range only has one element

         parallel_for(blocked_range<size_t>(0,(*r.begin()).size()), FtorProcessTInfo((*r.begin())));//LOOP2





for (int i = 0; i < 8; i++)

    parallel_for(tinfoVecsTLS.range(1), FtorProcessTInfoVecTLS(i),simple_partitioner());//LOOP1


However, I'm not sure how nesting the parallel_for inside another parallel_for would work, nor if it is the best way to do it. Any comment or suggestions are very welcome.

Thanks in advance!

0 Kudos
1 Reply

Hi, nesting works perfectly with TBB. If you are concerned about every last bit of the performance and the work is very small, use the same task_group_context for all parallel_fors

0 Kudos