parallel_for randomly hangs

mcclanahoochie · ‎01-25-2012

Hi,

.

I have a simple parallel_for loop that seems to hang on the majority of the runs of the code. Sometimes it works great, and other times it hangs.

.

When using gdb and breakpointing when it hangs, I find the following:

.

(gdb) bt

tbb::internal::custom_scheduler<:INTERNAL::INTELSCHEDULERTRAITS>::receive_or_steal_task ()

tbb::internal::custom_scheduler<:INTERNAL::INTELSCHEDULERTRAITS>::local_wait_for_all ()

tbb::internal::generic_scheduler::local_spawn_root_and_wait ()

.

A slightly simplified version of the code is below, where "num_neighbors" is typically between 20-400.

.

tbb::parallel_for(tbb::blocked_range(0, num_neighbors),

ParallelRegionDistanceEvaluator(

&neighbor_infos,

num_descriptors,

&region_distances));

.

class ParallelRegionDistanceEvaluator {

public:

ParallelRegionDistanceEvaluator(const vector* neighbor_infos,

const int num_descriptors,

vector* results) :

neighbor_infos_(neighbor_infos),

num_descriptors_(num_descriptors),

results_(results)

{}

ParallelRegionDistanceEvaluator(const ParallelRegionDistanceEvaluator& rhs,

tbb::split) :

neighbor_infos_(rhs.neighbor_infos_),

num_descriptors_(rhs.num_descriptors_),

results_(rhs.results_)

{}

void operator()(const tbb::blocked_range& r) const {

vector descriptor_distances(num_descriptors_);

for (int i = r.begin(); i != r.end(); ++i) {

const float region_dist = Evaluate(*(*neighbor_infos_), descriptor_distances);

(*results_) = region_dist;

}

private:

const int num_descriptors_;

const vector* neighbor_infos_;

vector* results_;

};

.

Playing around a bit, forcing the "num_neighbors" value to parallel_for() to be greater than 100 or so *seems* to remedy/reduce the problem, but the code does reliably run every now and then with any "num_neighbors" size allowed, and I've seen it fail once when it's greater than 100 too.

It almost seems random when/if the code hangs during this call to parallel_for().

.

Any suggestions would be great.

.

Thanks,

~Chris

.

jimdempseyatthecove · ‎01-25-2012

Chris,

Your private vectors: results_ and neighbor_infos_
have not been pre-extended such that the stores in the parallel_for does not cause an expansion.
Rework your ctor's such that they pre-extend those vectors to the working size.

Jim Dempsey

RafSchietekat · ‎01-25-2012

"the code does reliably run every now and then"
:-)

parallel_for doesn't use the split constructor. descriptor_distances seems costly. neighbor_infos_ and results_ could be references, but that's a matter of taste.

I don't immediately see anything. What do you mean by "breakpointing when it hangs": attaching from gdb (breakpoints would be something else)? Maybe it's Evaluate() that's really hanging (use bt on all the threads)?

jimdempseyatthecove · ‎01-26-2012

>>and results_ could be references

Wether results_ is a reference or pointer does not matter. Since the user is using a std::vector the stores in his parallel_for must not expand the array.

>>"the code does reliably run every now and then"

This "reliably run" is symptomatic of by chance of:

a) having the results_ vector of size large enough before call
b)having one thread manage to completly perform a store-oops-expand-store before another thread enters the operator[]

Simple enough to test.

Outside the parallel_for (i.e. prior to call) issue

results[numberResults] = 0;

This will assure results vector is (re)sized appropriately (by one and only one thread).

Jim Dempsey

RafSchietekat · ‎01-26-2012

Hmm, since there is no resizing inside the parallel code I am assuming that this has been set up correctly. I would also expect an exception or a crash, if anything, instead of hanging, if this were the problem. And results[numberResults] would be out of bounds, but it wouldn't resize the vector: if the STL library implies boundary checks (which it doesn't need to), there would be an exception, and otherwise either nothing happens or the application crashes.

mcclanahoochie · ‎01-26-2012

Your private vectors: results_ and neighbor_infos_
have not been pre-extended ...

All vectors passed in as pointers are first pre-allocated to the correct length before calling parallel_for, and thenum_neighborsis constant and same for each thread.

~Chris

mcclanahoochie · ‎01-26-2012

parallel_for doesn't use the split constructor. descriptor_distances seems costly. neighbor_infos_ and results_ could be references, but that's a matter of taste.

Thanks, I will try removing this.

What do you mean by "breakpointing when it hangs":

I run my code inside gdb... when it hangs, I hit ctrl-c, then type bt, and that was the output.

The Evaluate() is a very simple const virtual function. Would a virtual function be a problem?

~Chris

mcclanahoochie · ‎01-26-2012

Thanks to all for the suggestions. I will debug some more tonight and update here tomorrow sometime.

~Chris

mcclanahoochie · ‎01-26-2012

OK, quick update...

It seems that switching to using the tbb::simple_partitioner() (instead of using the default auto_partitioner) has *so far* fixed the issue.

More testing being done, though swithing back and fourth between auto/simple partitioners breaks/fixes ,respectively, the issue it seems...

~Chris

RafSchietekat · ‎01-26-2012

#6 "I run my code inside gdb... when it hangs, I hit ctrl-c, then type bt, and that was the output."
You should backtrace all the threads.

#8 "It seems that switching to using the tbb::simple_partitioner() (instead of using the default auto_partitioner) has *so far* fixed the issue."
It could still be that auto_partitioner only revealed the issue. Try simple_partitioner with a (larger) grainsize: maybe you'll have the same result as with auto_partitioner, which would be strong evidence against the partitioner being the direct cause of the problem.

mcclanahoochie · ‎01-28-2012

So, profiling all threads reveals that one thread (of 8) is stuck on the tbbreceive_or_steal_task() and one is indeed stuck at the virtual vunction being called (when using the auto_partitioner).

I tried commenting out the entire inside of the virtual function (Evaluate) so that it does absolutely nothing, and it still hangs. Using the simple_partitioner works fine.

Increasing the grainsize to something like 64 or greater seems to help the auto_partitoner work better, while the simple_partitioner works regardless of the grainsize.

I will test more with various gransizes with both partitioners, but will probably stick with the simple_partitoner for now.

Thanks for the help.

~Chris

RafSchietekat · ‎01-29-2012

"I tried commenting out the entire inside of the virtual function (Evaluate) so that it does absolutely nothing, and it still hangs."
Strange.

"Increasing the grainsize to something like 64 or greater seems to help the auto_partitoner work better, while the simple_partitioner works regardless of the grainsize. "
Very strange.

Another idea would be to use an earlier version of TBB (we don't know which this one is). Of course we're working with only partial information here, so not all advice may be equally useful...

Anton_M_Intel · ‎02-01-2012

Yes, very strange. If you have a chance to prepare a [small] reproducer, I'll look into the issue.

Guillaume_L_B_ · ‎02-26-2012

I have noticed exactly the same issue with the Intel Smoke demo using TBB 4 update 3 (latest right now).

Thecustom_scheduler ::receive_or_steal_task method is taking almost 100% of the main thread.

It seems however when I switch the Scheduler Benchamrking off, then the cpu usage become "normal".

Could it be the same issue in your program ?