Alignment??

michaelnikelsky · ‎09-20-2007

Hi,

Im still pretty new to threading building blocks and immediately ran into problems. I want to use parallel_reduce to find some minimal and maximal values for some triangles. However, I am using SSE, so it is necessary to align my data. So my first try for the cals:

_MM_ALIGN16 class ScanTriangles
{
public:
_m128 m_countLeft[INITIALCANDIDATES ];

void operator(const tbb::blocked_range< unsigned int>& range)
{
.....
}
....
void join( const ScanTriangles& other)
{

for( unsigned int i = 0; i < numCandidates; ++i)
{
m_countLeft = _mm_add_ps( m_countLeft, other.m_countLeft);
}
}

However, this crashes in the join function since other is not 16 Byte aligned.

So my second try was to derive from a aligned base class:

class AlignedBaseClass
{
public:
FINLINE void* operator new( std::size_t size) throw( std::bad_alloc)
{
if( void* ptr = aligned_malloc( size, 16))
return ptr;
throw std::bad_alloc();
}

FINLINE void* operator new[]( std::size_t size) throw( std::bad_alloc)
{
return operator new(size);
}

FINLINE void* operator new( std::size_t size, const std::nothrow_t&) throw()
{
return aligned_malloc( size, 16);
}

FINLINE void* operator new[]( std::size_t size, const std::nothrow_t&) throw()
{
return operator new(size, std::nothrow);
}

FINLINE void operator delete( void* ptr) throw ()
{
if( ptr != 0)
aligned_free( ptr);
}

FINLINE void operator delete[]( void* ptr) throw()
{
operator delete( ptr);
}

FINLINE void operator delete( void* ptr, const std::nothrow_t&) throw()
{
if( ptr != 0)
aligned_free( ptr);
}

FINLINE void operator delete[]( void* ptr, const std::nothrow_t&) throw()
{
operator delete( ptr, std::nothrow);
}*
};

_MM_ALIGN16 class ScanTriangles : : public AlignedBaseClass

But this doesnt even compile anymore but gives this error:

D:software bb-2.0WIN32include bb /parallel_reduce.h(89) : error C2665: 'AlignedBaseClass::operator new' : none of the 2 overloads could conve
rt all the argument types
d:srcvredsourcecommonlibs aytracingut
ilsAlignedBaseClass.h(44): could be 'void *AlignedBaseClass::operator n
ew(size_t,const std::nothrow_t &) throw()'
while trying to match the argument list '(size_t, ScanTriangles *)'
D:software bb-2.0WIN32include bb/parallel_reduce.h(85) :
while compiling class template member function 'tbb::task *tbb::internal::start
_reduce::execute(void)'
with
[
Range=tbb::blocked_range,
Body=ScanTriangles
]
D:software bb-2.0WIN32include bb/parallel_reduce.h(122)
: see reference to class template instantiation 'tbb::internal::start_reduce
ge,Body>' being compiled
with
[
Range=tbb::blocked_range,
Body=ScanTriangles
]

So how can I use SSE with threading building blocks? Using unaligned loads/stores is not really an option since it is a lot slower.

Any ideas?

TimP · ‎09-20-2007

It seems a reasonable request to have TBB compatible with alignment, perhaps necessary for use with non-Intel compilers. I would suggest a bug report with smallest possible complete examples.

robert-reed · ‎09-20-2007

Sure, file a bug report if you've identified a problem with TBB, but let's first understand the problem. For that I refer to the initial code statement. And I'm having some difficulty understanding it. For example, the code has a join operation that appears to scan over the entire range of"m_countLeft"but no reduction kernel (the operator() functor) to do the parallel work. That kernel should have a for construct using r.begin() and r.end() to capture the local reduction range that the current thread should operator over. The reduction function should be operating on only a pair of reduced accumulations, something along the lines of

m_countLeft = _mm_add_ps(m_countLeft, other.m_countLeft);

There should be a 16-byte aligned list of triangle structures that can be passed initially to the parallel_reduce function. The task scheduler would then launch instances of the operato() functor to reduce the data to a set of local copies, one per task. These will then be passed to the join operator for the final reduction. You might also define a blocked_range that operates over an array of 16-byte-aligned triangle objects to ensure their alignment.

Perhaps you can provide more detail about the reduction operation you'd like to perform and the nature of the original triangle objects that would be processed by this code?

michaelnikelsky · ‎09-21-2007

Sorry, I didnt show what the operator() does since it doesnt matter ( and is about 1000 lines of code...). But to make it short: I sweeps over a list of triangles and compares each triangle to a number (INITIALCANDIDATES * 4) of splitting planes. As the result I simply need the number of triangles lying left of each splitting plane, thats why there still is the loop in the join function (it is not the actual work thats done).

By the way: The code works perfectly when I use the unaligned loads and stores. So the problem really is deep inside the threading building blocks, to be more precise: The parallel_reduce calls a new and the splitting constructor at some point. The new has a second parameter, a memory pool, which seems to be not aligned and therefore it crashes.

I will send a bug report with a small example, shouldnt be too hard to reproduce.

robert-reed · ‎09-21-2007

Sorry. It was a long day yesterday and I failed to parse your code snippet correctly. So your goal is to have all the ScanTriangles 16-byte aligned so that the m_countLeft array will be properly aligned for the reduce operation. And I think I see the problem.

I did a little snooping into the parallel_reduce codeand foundthis function:

template
task* start_reduce::execute() {
Body* body = my_body;
if( is_stolen_task() ) {
finish_reduce* p = static_cast(parent() );
body = new( p->zombie_space.begin() ) Body(*body,split());
my_body = p->right_zombie = body;
}
task* next_task = NULL;
if ( my_partitioner.should_execute_range(my_range, *this))
(*my_body)( my_range );
else {
finish_reduce& c = *new( allocate_continuation()) finish_type(body);
recycle_as_child_of(c);
c.set_ref_count(2);
start_reduce& b = *new( c.allocate_child() ) start_reduce(Range(my_range,split()), body, Partitioner(my_partitioner,split()));
c.spawn(b);
next_task = this;
}
return next_task;
}

This is the execute function that does the partitioning for parallel_reduce. If the input range is too small to split, the blue section of code invokes the operator() to scan the range. If the input range is splittable, the green section creates a new task to handle the other half and kicks it off. When a new thread enters to execute, it takes the yellow path and creates a new body object out of storage in the finish_reduce object which, although it is called an aligned_space is not declared with any specific alignment. That might well be the source of the unaligned m_countLeft array causing the alignment faults you've experienced.

Here's maybe a simpletest ofthis diagnosis. The size of Body affects the size of the finish_reduce class, so there can't be anyruntime library code that depends on its size. It probably wouldn't disturb anything to stick an _MM_ALIGN16 ahead of the zombie_space declaration in tbb/parallel_reduce.h and see if that cuts the alignment faults:

template
class finish_reduce: public task {
Body* const my_body;
Body* right_zombie;
_MM_ALIGN16
&nbs p; aligned_space zombie_space;

It's a hack but a useful test.

robert-reed · ‎09-21-2007

Well, that code snippet got kind of messed up. Apparently trusting too much in the features of this forum tool, my highlight coding failed to transfer but the text coding did, causing part of the code to end up white on white. Let's try this again:

template
task* start_reduce::execute() {
Body* body = my_body;
if( is_stolen_task() ) {
finish_reduce* p = static_cast(parent() );
body = new( p->zombie_space.begin() ) Body(*body,split());
my_body = p->right_zombie = body;
}
task* next_task = NULL;
if ( my_partitioner.should_execute_range(my_range, *this))
(*my_body)( my_range );
else {
finish_reduce& c = *new( allocate_continuation()) finish_type(body);
recycle_as_child_of(c);
c.set_ref_count(2);
start_reduce& b = *new( c.allocate_child() ) start_reduce(Range(my_range,split()), body, Partitioner(my_partitioner,split()));
c.spawn(b);
next_task = this;
}
return next_task;
}

I won't touch it further. Perhaps now it will all be visible.

ARCH_R_Intel · ‎09-26-2007

I think there's two problems here.

As pointed out, the "zombie space" is not sufficiently aligned. That can be fixed as suggested. I may attempt to use a little partial specialization so that we're not wasting the full 16 bytes when the zombie space is small.
TBB on 32-bit platforms allocates task objects in the heap on 8-byte boundaries, not 16-byte boundaries. When TBB allocates a task object, it's really first allocating a "task_prefix" object followed immediately by the task object. The task prefixes on 32-bit system are 24 bytes. We need to pad them out to 32 bytes. A similar problem may exist for 64-bit platforms.

I'll take up fixing this since I'm working on the task scheduler anyway. I'll start by adding a regression test(s) to our unit tests.Alas the fix will miss the 2.0 update release. I'll let people know when it shows up in the development release.

ARCH_R_Intel · ‎09-26-2007

I've fixed the problem and added regression tests to my private copy of TBB. Since it may take afew days before the changes to propagate to the downloadble "developement" sources, you might try the following hack if you are willing to recompile the TBB library from the sources.

Add the _MM_ALIGN16 directive Robert suggested. Or, for a more thorough fix, edit include/tbb/aligned_space.h and add a __m128 member to the union in class aligned_space. (Adding a 16-byte alignment attribute might work too, except that many versions of gcc incorrectly handle attributes in template classes.)
Edit include/tbb/task.h and insert padding members at the front of class task_prefix to make sizeof(task_prefix) a multiple of 16. I think it's currently 24 bytes on 32-bit platforms, so adding a dummy member of type double should do the trick. It would be a good idea to write a small program that does "printf("%d ",int(sizeof(tbb::internal::task_prefix)))" to make sure the padding is the right amount. Likewise, adding a double should fix the situation on 64-bit platforms (it was 40 bytes, and should be padded to 48).
Recompile and rerun tests by running "gmake" at the top level. The library will be binary incompatible with previous versions, because of the layout difference. Afterwards, you should see that the address of a task object always fall on a 16-byte boundary.

After those changes, I think that parallel_reduce should work as expected..

- Arch

michaelnikelsky · ‎10-04-2007

Thanks, I will try that as soon as possible (Ive got plenty of work todo at the moment though :( )