enumerable_thread_specific and nested parallel_for

Yaidel_R_ · ‎11-29-2013

I want to iterate over a 3D grid to obtain 8 std::vector<TInfo>s, where TInfo represents a set of operations. It is basically like a coloring algorithm where each of the 8 vectors corresponds to one color, and every TInfo instance in one vector can be processed concurrently with all the others in the same vector (without data races). This been said, once I have the 8 vectors I could do the following (not necessarelly valid c++ code):

[cpp]

std::vector<TInfo> tinfoVectors[8];

getTInfoVectors(grid, tinfoVectors); // This is a sequential procedure, iterates on the grid and gets the vectors

for (int i = 0; i < 8; i++)

parallel_for(blocked_range<size_t>(0,tinfoVectors.size()),FtorDoSomething(tinfoVectors));

[/cpp]

I want to parallelize getTInfoVectors. This procedure iterates over the grid and creates 8 TInfo instances for every grid cell according to a given pattern, pushing the instances into the right tinfoVector. So parallelizing this leads to multiple threads adding elements at the same time to tinfoVector, resulting in memory races. Two possibilities come to my mind:

to use tbb::concurrent_vector instead of std::vector, but getTInfoVectors adds elements to tinfoVectors very intensively, so I don't think I can gain much using tbb::concurrent_vector, thus, I discarded it.
to use tbb::enumerable_thread_specific and have an independent tinfoVectors per thread. This will be more efficient.

The second option slightly complicates the processing of tinfoVectors, it can be done like this:

[cpp]

typedef std::vector<TInfo> TInfoVectors[8];

tbb::enumerable_thread_specific<TInfoVectors> tinfoVectorsETS;

getTInfoVectorsParallel(grid, tinfoVectorsETS);

for (int i = 0; i < 8; i++)

for (t = tinfoVectorsETS.begin(); t != tinfoVectorsETS.end(); t++) //LOOP1

parallel_for(blocked_range<size_t>(0,(*t).size()),FtorDoSomething((*t))); //LOOP2

[/cpp]

Since there is some implicit synchronization at the end of a parallel_for, LOOP1 is adding unnecessary overhead. If I could flatten LOOP1 and LOOP2 and then do a parallel_for I could avoid this. I'm aware of tbb::flattened2d but parallel_for can not be used with it because flattened2d only support forward iterators, and replacing parallel_for with parallel_do will also introduce unnecessary overhead.

Therefore, I'm thinking of nesting two parallel_for (i.e. also making LOOP1 parallel), something like this:

[cpp]

struct FtorProcessETS{

// declare local variables &tinfoVectorsETS and i and suitable constructor that initialize them

//...

void operator()(const range_type &r){

// r contains only one element...

parallel_for(blocked_range<size_t>(0,tinfoVectorsETS[r.begin()].size()),FtorDoSomething(tinfoVectorsETS[r.begin()])); // LOOP2

}

};

//...

for (int i = 0; i < 8; i++)

parallel_for(tinfoVectorsETS.range(1), FtorProcessETS(tinfoVectorsETS,i), tbb::simple_partitioner()); // LOOP1

[/cpp]

but honestly, I'm not sure how the nested parallel_for work, neither if this would be the best solution.

Any comment or suggestion is very welcome! Thanks in advance!

Yaidel_R_ · ‎11-30-2013

I submitted this question twice, when I submitted the question the first time on Friday, it didn't appear on the forum. I waited till Saturday and since it wasn't still on the forum I submitted the same question again (I thought I did something wrong while submitting it), I'm sorry. Please check http://software.intel.com/en-us/forums/topic/494844, and leave your comments/suggestions there. I'm sorry for the inconvenience.