Intel® oneAPI Threading Building Blocks
Ask questions and share information about adding parallelism to your applications when using this threading library.
2466 Discussions

enumerable_thread_specific and nested parallel_for

Yaidel_R_
Beginner
221 Views

Hello, 

I'm working on the following problem: I need to iterate over a 3D grid and create 8 std::vector<TInfo>s, where TInfo instances represent a set of operations. This is kind of a coloring algorithm where each of the 8 vectors corresponds to a color, and all the TInfo instances in the same vector can be  processed concurrently without data races. This been said, I can do something like the following (not necessarelly valid c++ code):

[cpp]

std::vector<TInfo> tinfoVectors[8];

obtainTInfoVectors(grid,tinfoVectors); // This is a sequential procedure

for (int i = 0; i < 8; i++)

    parallel_for(blocked_range<size_t>(0,tinfoVectors.size()), FtorProcessTInfo(tinfoVector));

[/cpp]

I want to parallelize obtainTInfoVectors. This function goes through all the grid cells and creates 8 TInfo​ instances per cell, and pushes the instances into the right tinfoVectors. This means that in a multi-threaded implementation, different threads will try to add elements at the same time to tinfoVectors​ creating data races. Two solutions come to my mind:

  1. Replace std::vector with tbb::concurrent_vector: the problem is that obtainTInfoVectors​ very intensivelly adds elements to the vectors, and I don't really expect to have much gain using tbb::concurrent_vector. Thus, I discarded this option.
  2. To use tbb::enumerable_thread_specific to have a tinfoVectors per thread. This, I beleive, is more efficient.

Using  tbb::enumerable_thread_specific, the code can change to something like this:

[cpp]

typedef std::vector<TInfo> TInfoVectors[8];

enumerable_thread_specific<TInfoVectors> tinfoVecsTLS;

obtainTInfoVectorsParallel(grid,tinfoVecsTLS);

for (int i = 0; i < 8; i++)

    for (t = tinfoVecsTLS.begin(); t != tinfoVecsTLS.end(); t++) //LOOP1

        parallel_for(blocked_range<size_t>(0,(*t).size()), FtorProcessTInfo((*t)));//LOOP2

[/cpp]

However, since parallel_for has some implicit synchronization, I'm afraid that LOOP1 can introduce unnecessary overhead. If I could flatten LOOP1 and LOOP2 into one loop and perform a parallel_for would be ideal. I'm aware of flattened2d​, but since it only supports forward iterator is not suited for parallel_for, and using parallel_do instead would result in unnecessary overhead also. So I'm thinking of make LOOP1 parallel as well, and get something like the following:

[cpp]

struct FtorProcessTInfoVecTLS{

    typedef enumerable_thread_specific<TInfoVectors>::const_range_type range_t;

    int i;

    FtorProcessTInfoVecTLS(int i_p)

    : i(i_p)

    {}

    void operator()(range_t & r) const {

        // the range only has one element

         parallel_for(blocked_range<size_t>(0,(*r.begin()).size()), FtorProcessTInfo((*r.begin())));//LOOP2

    }

};

//...

//...

for (int i = 0; i < 8; i++)

    parallel_for(tinfoVecsTLS.range(1), FtorProcessTInfoVecTLS(i),simple_partitioner());//LOOP1

[/cpp]

However, I'm not sure how nesting the parallel_for inside another parallel_for would work, nor if it is the best way to do it. Any comment or suggestions are very welcome.

Thanks in advance!

0 Kudos
1 Reply
Anton_M_Intel
Employee
221 Views

Hi, nesting works perfectly with TBB. If you are concerned about every last bit of the performance and the work is very small, use the same task_group_context for all parallel_fors

0 Kudos
Reply