enumerable_thread_specific and nested parallel_for

Yaidel_R_ · ‎11-30-2013

Hello,

I'm working on the following problem: I need to iterate over a 3D grid and create 8 std::vector<TInfo>s, where TInfo instances represent a set of operations. This is kind of a coloring algorithm where each of the 8 vectors corresponds to a color, and all the TInfo instances in the same vector can be processed concurrently without data races. This been said, I can do something like the following (not necessarelly valid c++ code):

[cpp]

std::vector<TInfo> tinfoVectors[8];

obtainTInfoVectors(grid,tinfoVectors); // This is a sequential procedure

for (int i = 0; i < 8; i++)

parallel_for(blocked_range<size_t>(0,tinfoVectors.size()), FtorProcessTInfo(tinfoVector));

[/cpp]

I want to parallelize obtainTInfoVectors. This function goes through all the grid cells and creates 8 TInfo instances per cell, and pushes the instances into the right tinfoVectors. This means that in a multi-threaded implementation, different threads will try to add elements at the same time to tinfoVectors creating data races. Two solutions come to my mind:

Replace std::vector with tbb::concurrent_vector: the problem is that obtainTInfoVectors very intensivelly adds elements to the vectors, and I don't really expect to have much gain using tbb::concurrent_vector. Thus, I discarded this option.
To use tbb::enumerable_thread_specific to have a tinfoVectors per thread. This, I beleive, is more efficient.

Using tbb::enumerable_thread_specific, the code can change to something like this:

[cpp]

typedef std::vector<TInfo> TInfoVectors[8];

enumerable_thread_specific<TInfoVectors> tinfoVecsTLS;

obtainTInfoVectorsParallel(grid,tinfoVecsTLS);

for (int i = 0; i < 8; i++)

for (t = tinfoVecsTLS.begin(); t != tinfoVecsTLS.end(); t++) //LOOP1

parallel_for(blocked_range<size_t>(0,(*t).size()), FtorProcessTInfo((*t)));//LOOP2

[/cpp]

However, since parallel_for has some implicit synchronization, I'm afraid that LOOP1 can introduce unnecessary overhead. If I could flatten LOOP1 and LOOP2 into one loop and perform a parallel_for would be ideal. I'm aware of flattened2d, but since it only supports forward iterator is not suited for parallel_for, and using parallel_do instead would result in unnecessary overhead also. So I'm thinking of make LOOP1 parallel as well, and get something like the following:

[cpp]

struct FtorProcessTInfoVecTLS{

typedef enumerable_thread_specific<TInfoVectors>::const_range_type range_t;

int i;

FtorProcessTInfoVecTLS(int i_p)

: i(i_p)

{}

void operator()(range_t & r) const {

// the range only has one element

parallel_for(blocked_range<size_t>(0,(*r.begin()).size()), FtorProcessTInfo((*r.begin())));//LOOP2

}

};

//...

for (int i = 0; i < 8; i++)

parallel_for(tinfoVecsTLS.range(1), FtorProcessTInfoVecTLS(i),simple_partitioner());//LOOP1

[/cpp]

However, I'm not sure how nesting the parallel_for inside another parallel_for would work, nor if it is the best way to do it. Any comment or suggestions are very welcome.

Thanks in advance!

Anton_M_Intel · ‎12-01-2013

Hi, nesting works perfectly with TBB. If you are concerned about every last bit of the performance and the work is very small, use the same task_group_context for all parallel_fors