Mixing of data inside a parallel for loop

ceccherini · ‎11-26-2007

I have a piece of code like the following:

for (int i=r.begin(); i!=r.end(); ++i) {

if (i==0) ApplyFoo(Array[0] );
if (i==1) ApplyFoo(Array[1] );
if (i==2) ApplyFoo(Array[2] );
if (i==3) ApplyFoo(Array[3] );

}

Note that my iteration range is from 0 to 3 and I put GrainSize = 1. The idea would be to distribute to each core a single ApplyFoo which should put in the Array the results obtained. ApplyFoo contains a lot of operations and therefore makes sense to have GrainSize = 1. Later I would like to collect the 4 Arrays.

The problem I experience is that Array doesn't contain its own results but something mixed or in some case all Array have the same values. It seems that the results have been mixed.

This doesn't happen if GrainSize is greater than 4 but then everything is sequential and there is no difference with respect to a usual sequential code.

I'm sure I'm missing something.

Thanks a lot for your help !!!

Cheers,
Francesco

robert-reed · ‎11-27-2007

The first observation I make looking at this code is that it should suffer some thrashing as the various threads return their results to adjacent elements of Array, which could share a common cache line (no indication how large the elements of Array are). This is a case called "false sharing" since no bad data is written, just good data thrashing a cache line.

Next, I wonder whether this same construct could not be represented equivalently by this much simpler code:

for (int i=r.begin(); i != r.end(); ++i) { ApplyFoo(Array); }

The mixed results you describe sound like the results of a race condition, but nothing in the code leaps out a me to explain why. It could be explained if ApplyFoo does not respect the element size of Array. Or if ApplyFoo is not thread safe for one reason or another. I cannot speculate further without more details.

ceccherini · ‎11-30-2007

Thanks for your reply! After yor message I made furter investigations and I realized that the problem is probably due to the fact that ApplyFoo is a function of a library which is not thread safe. In fact, so far I solved the problem with the following:

typedef spin_mutex LibMutexType;
LibMutexType LibMutex1;
.....
.....
for (int i=r.begin(); i!=r.end(); ++i) {

if (i==0) { LibMutexType::scoped_lock mylock1(LibMutex1);
ApplyFoo(Array[0] ); }
if (i==1) { LibMutexType::scoped_lock mylock1(LibMutex1);
ApplyFoo(Array[1] ); }
if (i==2) { LibMutexType::scoped_lock mylock1(LibMutex1);
ApplyFoo(Array[2] ); }
if (i==3) { LibMutexType::scoped_lock mylock1(LibMutex1);
ApplyFoo(Array[3] ); }

}

Even if the code runs properly, there is clearly no improvement at all with respect to the usual serial code. Do you have any suggestion in oder to run ApplyFoo on the different cores at the same time but without "interferences" among the different threads ?

Thanks again !
Francesco C.

robert-reed · ‎11-30-2007

There's only two ways out of this dilemma: replace the non-thread-safe ApplyFoo with a version that is reentrant--that allows multiple processors to execute it simultaneously--or serialize the code as has been done in the example above. Only one branch can hold the lock, so the code should run about like the serial version, albeit with a bit more overhead because of all the extra code. And given my previous simplified example, I'm not sure why the above code could not be similarlysimplified:

for (int i=r.begin(); i != r.end(); ++i) { LibMutexType::scoped_lock mylock1(LibMutex1);ApplyFoo(Array); }

This makes it pretty clear that only one copy of ApplyFoo is going to run at a time.

Why might ApplyFoo not be thread safe? It might exhibit a variety of symptoms, but often the problems come from writing to global data. Reading read-only global data and writing only to local scope automatics or topartitioned shared data are among the techniques for avoiding races, where two processors race to write and/or read from the same location. When changing state must be shared among processors, then using locks with less of a scope than the global lock employed in the example above could reduce the contention range and thus open blocks of concurrent execution.