Link Copied
Sure, file a bug report if you've identified a problem with TBB, but let's first understand the problem. For that I refer to the initial code statement. And I'm having some difficulty understanding it. For example, the code has a join operation that appears to scan over the entire range of"m_countLeft"but no reduction kernel (the operator() functor) to do the parallel work. That kernel should have a for construct using r.begin() and r.end() to capture the local reduction range that the current thread should operator over. The reduction function should be operating on only a pair of reduced accumulations, something along the lines of
m_countLeft = _mm_add_ps(m_countLeft, other.m_countLeft);
There should be a 16-byte aligned list of triangle structures that can be passed initially to the parallel_reduce function. The task scheduler would then launch instances of the operato() functor to reduce the data to a set of local copies, one per task. These will then be passed to the join operator for the final reduction. You might also define a blocked_range
Perhaps you can provide more detail about the reduction operation you'd like to perform and the nature of the original triangle objects that would be processed by this code?
Sorry. It was a long day yesterday and I failed to parse your code snippet correctly. So your goal is to have all the ScanTriangles 16-byte aligned so that the m_countLeft array will be properly aligned for the reduce operation. And I think I see the problem.
I did a little snooping into the parallel_reduce codeand foundthis function:
template
task* start_reduce::execute () {
Body* body = my_body;
if( is_stolen_task() ) {
finish_reduce* p = static_cast(parent() ); task* next_task = NULL;
body = new( p->zombie_space.begin() ) Body(*body,split());
my_body = p->right_zombie = body;
}
if ( my_partitioner.should_execute_range(my_range, *this))
(*my_body)( my_range );
else {
finish_reduce& c = *new( allocate_continuation()) finish_type(body);
recycle_as_child_of(c);
c.set_ref_count(2);
start_reduce& b = *new( c.allocate_child() ) start_reduce(Range(my_range,split()), body, Partitioner(my_partitioner,split()));
c.spawn(b);
next_task = this;
}
return next_task;
}
This is the execute function that does the partitioning for parallel_reduce. If the input range is too small to split, the blue section of code invokes the operator() to scan the range. If the input range is splittable, the green section creates a new task to handle the other half and kicks it off. When a new thread enters to execute, it takes the yellow path and creates a new body object out of storage in the finish_reduce object which, although it is called an aligned_space is not declared with any specific alignment. That might well be the source of the unaligned m_countLeft array causing the alignment faults you've experienced.
Here's maybe a simpletest ofthis diagnosis. The size of Body affects the size of the finish_reduce class, so there can't be anyruntime library code that depends on its size. It probably wouldn't disturb anything to stick an _MM_ALIGN16 ahead of the zombie_space declaration in tbb/parallel_reduce.h and see if that cuts the alignment faults:
template
class finish_reduce: public task {
Body* const my_body;
Body* right_zombie;
_MM_ALIGN16
&nbs p; aligned_space zombie_space;
It's a hack but a useful test.
Well, that code snippet got kind of messed up. Apparently trusting too much in the features of this forum tool, my highlight coding failed to transfer but the text coding did, causing part of the code to end up white on white. Let's try this again:
template
task* start_reduce::execute() {
Body* body = my_body;
if( is_stolen_task() ) {
finish_reduce* p = static_cast(parent() );
body = new( p->zombie_space.begin() ) Body(*body,split());
my_body = p->right_zombie = body;
}
task* next_task = NULL;
if ( my_partitioner.should_execute_range(my_range, *this))
(*my_body)( my_range );
else {
finish_reduce& c = *new( allocate_continuation()) finish_type(body);
recycle_as_child_of(c);
c.set_ref_count(2);
start_reduce& b = *new( c.allocate_child() ) start_reduce(Range(my_range,split()), body, Partitioner(my_partitioner,split()));
c.spawn(b);
next_task = this;
}
return next_task;
}
I won't touch it further. Perhaps now it will all be visible.
I think there's two problems here.
I'll take up fixing this since I'm working on the task scheduler anyway. I'll start by adding a regression test(s) to our unit tests.Alas the fix will miss the 2.0 update release. I'll let people know when it shows up in the development release.
I've fixed the problem and added regression tests to my private copy of TBB. Since it may take afew days before the changes to propagate to the downloadble "developement" sources, you might try the following hack if you are willing to recompile the TBB library from the sources.
After those changes, I think that parallel_reduce should work as expected..
- Arch
For more complete information about compiler optimizations, see our Optimization Notice.