poor performance with nested parallel_for

R_T_ · ‎08-03-2014

BigTask( int i )
{
Data data = GetData(i);
vector<int> subItemList = GetSubItems(i);

parallel_for( blocked_range<int>( 0, subItemList.size() ),
[&] ( blocked_range<int>& range ) {
for( i=range.begin(); i!=range.end(); ++i )
MedTask1(data);
}
);

SmallTask1(data);

parallel_for( blocked_range<int>( 0, subItemList.size() ),
[&] ( blocked_range<int>& range ) {
for( i=range.begin(); i!=range.end(); ++i )
MedTask2(data);
}
);

SmallTask2(data);

parallel_for( blocked_range<int>( 0, subItemList.size() ),
[&] ( blocked_range<int>& range ) {
for( i=range.begin(); i!=range.end(); ++i )
MedTask3(data);
}
);

SmallTask3(data);

parallel_for( blocked_range<int>( 0, subItemList.size() ),
[&] ( blocked_range<int>& range ) {
for( i=range.begin(); i!=range.end(); ++i )
MedTask4(data);
}
);
}

Run()
{
vector<int> itemList;
parallel_for( blocked_range<int>( 0, itemList.size() ),
[&] ( blocked_range<int>& range ) {
for( i=range.begin(); i!=range.end(); ++i )
BigTask(itemList);
}
);
}

I have a list of large items to work on, each completely independent, which vary in the amount of work required. Each of the big items reads in a large amount of data, and passes through a number of steps, some which can be parallelized and some small tasks which do some aggregation amoung the sub-items. Via profiling with Vtune, I can see that when running with a single item in the outer parallel_for loop, I get very good CPU utilization from the inner loops parallelized within BigTask. However, when I move to a full-size configuration (100-200 items in the outer loop and 50-150 items in the inner loops), I get very poor CPU utilization. The large objects have large memory usage so I try to limit the number of concurrent BigTasks. I've tried limiting via grainsize (which makes it difficult to explicitly set a max concurrency) and I've also tried task_arena. However when I use the task_arena, it's not clear if my inner loops are able to use any more than the number of threads that my outer loop is using. I'd like to limit the BigTasks to say, 6 concurrent items, with the sub-items utilizing all available resources for the medium tasks. Is this nested parallel_for approach reasonable or is there something better I'm missing. Is it possible that my performance problems are due mostly to grainsize issues with the inner loops? I appreciate any insights anyone can give.

Alexey-Kukanov · ‎08-08-2014

For the 3rd time in a row, I recommend to try a two-stage parallel_pipeline :) It allows you to specify the number of large objects processed simultaneously, and thus put a limit to memory usage. The input stage would iterate over the itemList taking one item at a time and passing it down the pipeline, essentially replacing your outer parallel_for. The second parallel stage would call BigTask on the received item.