Problem with OpenMP addition into a shared array - Page 2

smithpd · ‎10-20-2009

Solution found

I have solved this problem myself, with a little discussion with a colleague. The first implementation described in this post is not thread safe. The second implementation described below in post #4 is thread safe. The fundamental principle is that you should not allow two threads to add into (or write) one array element at the same time. I was deluded into thinking that the first implementation would work because I was specifying that an "array" was shared. In fact, only the pointer to the array was shared, not the actual data in the array, which had been allocated using malloc. The problem was evidently caused by collisions between two threads trying to modify the same data element of the array.

This may be of interest to others who encouter similar issues.

Although a solution has been found, a secondary issue is that it might be nice if either OpenMP or the Intel compiler provided a mechanism to protect the elements of an array from collision, or at least give a warning if they were not protected. Or do they?

Original problem - see below for solution

I have some OpenMP code that is trying to perform additions into the elements of a shared 2-D array, but it creates random errors, as if the threads were interfering / conflicting with one another. The code that performs the additions is in an external function. A pointer to the shared array is passed into the external function. The calling function is parallel, but the external function that does the additions is not. At issue is how a single iteration of a parallel for loop executes the external function and whether multiple threads can cause interference in the addition into the shared array. If I make the external code section that performs the addition omp critical, the random error goes away, but so does the speed gain -- parallism is destroyed. If I run the code with the #pragma omp commented out, it also works, again destroying parallel speed.

The logic is described below. Obviously it is wrong or not thread safe. Please help me fix this if you can.
Sorry, I don't know how to make my indents stick.

existing logic (not exact code)
--------------------------------------------------
excerpt from calling function, containing parallel for loop
--------------------------------------------------

float var1, var2;
int i, j;
float** outxy; // shared accumulator

//... function allocate_matrix() defines row pointers so outxy may be indexed as outxy
outxy = allocate_matrix(rows, columns);
//... zero out accumulator
for (i = 0; i < rows; i++)
{
for (j = 0; j < columns; j++)
{
outxy = 0;
}
}
//... calculate local variables that are independent of loop index
var1 = xxx;

#pragma omp parallel for shared(outxy, nloop, rows, columns) private(iloop, var1, var2)
for (iloop = 0; iloop < nloop; iloop++) // intent is to have one thread for each iloop index
{
//... calculate local variables that depend on iloop index
var2 = yyy;
//... for this value of iloop, perform addition into outxy elements using external function accumulate()
accumulate(outxy, iloop, rows, columns, var1, var2); // called in parallel: each call is by one thread?
}

//... process (outxy)
free_matrix(outxy)
.
.
.

------------------------------------------------
external function that does accumulation
------------------------------------------------

void accumulate(float** out, int loop, int rowmax, int colmax, float x, float y)
{
int i, j;
//... calculate local variables for this function
var = f1(x, y, loop);
//... perform addition for one value of "loop" in calling function
for (i = 0; i < rowmax; i++)
{
for (j = 0; j < colmax; j++)
{
//... add stuff into an element of out matrix
stuff = f2(i, j, var);
out += stuff;
}
}
}

------------------------

Thanks in advance for your help.

smithpd · ‎10-28-2009

LOL. I feel stupid for omitting the braces.

OK, I put the braces in, and it worked with the outside-loop array allocation. Allocating outside the loop gives one array per thread, whereas allocating inside the loop gives one array per loop index. This saves about .5 seconds out of 3.5 seconds, so that aspect of it works.

But now I still a problem with the logic of the outer loop and run time being slower when I try to create threads at that point. Referring to your post #10, if my code is essentially as you outlined it, I have variant 1:

// variant 1
//-----------

#pragma omp parallel (your shared/private here)
{ // scope of parallel region
// allocate array here. Creates one array per thread
for(... outer loop here // entire loop runs for each thread
{
// (omit parallel from line))
#pragma omp for
for(... // inner parallel loop as you have now
{
...
} // end of #pragma omp for
} // end of for(... outer loop here
// free array, one per thread
} // end scope of parallel region

Then it is thread safe, and it runs in 3.4 seconds.

Note that the entire outer loop is run separately by each thread, but the accumulation section is partitioned betwen threads, so the results are OK in the end. I verified this with print statements inside the outer loop. I think this code is strange and hard to follow. It also runs a little more slowly than other examples below.

---------------------

If I parallelize the outer loop but not the inner loop, I have variant 2:

// variant 2
//-----------

#pragma omp parallel (your shared/private here)
{ // scope of parallel region
// allocate array here, one per thread
#pragma omp for // (omit parallel from line)
for(... outer loop here) // this loop is now parallel and distributed between threads
{

for(... // inner loop, non-parallel
{
...
} // end of inner non-parallel loop
} // end of parallel outer loop
// free array, one per thread
} // end scope of parallel region

This is basically my old code that failed because it is not thread safe -- more than one thread can try to accumulate numbers into the same element of the outxy array at the same time, so they collide randomly and something bad happens when they do collide. It runs in 2.9 seconds and gives random errors.

----------------------

If I parallelize both outer and inner loops by putting "#pragma omp for" in front of each loop, then only 1/(number of threads) projections are accumulated in the inner loop. It runs 8x faster but the results are completely wrong.

----------------------

If I parallelize only the inner loop, my implementation #2 as reported in post #4, then I have variant 3:

// variant 3
//-----------

// allocate array only once and share it within the outer loop
for(... non-parallel outer loop)
{
// populate array and call projection function
#pragma omp parallel for
for(... // inner parallel loop as you have now
{
...
} // end of #pragma omp parallel for
} // end of for(... outer loop here
// free array, only one

This is thread safe, it also runs in 2.9 seconds, and it is considerably simpler code for a variety of reasons.

It seems to me that variant 3 is the way to go, and nothing is to be gained by messing with the outer loop. Please correct me if I am wrong.

jimdempseyatthecove · ‎10-29-2009

Try the following in the code with the outer loop as parallel (working but more complex)

use threadprivate to create a persistant temporary array, one per thread. Use two variables, one a pointer to the array and the second the allocation extent.

// in static data
int allocationSize = 0;
double* Array = 0;
#pragma omp threadprivate(allocationSize, Array)
------------------
// in your subroutine
#pragma omp parallel ...
{
if(allocationSize < sizeYouNeedNow)
{
if(Array) delete [] Array;
Array = new double [sizeYouNeedNow];
if(!Array) Barf();
allocationSize= sizeYouNeedNow;
}
...

The temp array will not persist across calls (reducing number of malloc/free).

Note, array will persist through program termination. You can add a cleanup routine.

void Cleanup()
{
if(Array)
{
delete [] Array;
Array = NULL;
allocationSize = 0;
}
}

Jim Dempsey

smithpd · ‎10-29-2009

Thanks again for your help, Jim.

Sorry, I do not understand why I would want to do the above. The array allocation is not the problem. Regardless, the array in question is small, of order N rather than N**2, and it has a fixed, known size. The outxy that we are slicing over in the inner loop is of order N**2, so there is a benefit in slicing it between threads with one row of order N processed in a thread. I should add that N could be 4000. The problem with your suggested code is that the outer loop is awkward in variant #1 (see my comment above) and does not provide any benefit in terms of speed compared with variant #3, which I am now using. From what I have seen in testing, I doubt that any degree of messing around with the outer loop will make it run faster. Again, if I am wrong, please tell me how doing something in the outer loop improves efficiency and keeps the inner loop accumulation into outxy thread safe.

Since the discusison seems to have strayed from it, and maybe I never described it well enough, here is a brief description of the algorithm:

Outer loop over M projection angles (M ~= N)
{
Collect an equivalent image to be projected.
The equivalent image is of size L rows x N columns, with L ~= N/10. The L rows of the L x N projection image are collected as an array of row pointers from a non-contiguous set of image rows stored elsewhere. This defines an equivalent contiguous projection image that can be indexed as equivalent_image.
The pointer collection operation is simple indexing and very fast. There is no need to parallelize it. Then call
project(equivalent_image, ...);
}

project(equivalent_image, ...)
{
First inner loop over N rows in a reconstruction image outxy of size N x N
Innermost loop over N columns in outxy
calculate quantity to be added into outxy[irow][icol] by interpolation in equivalent_image
outxy[irow][icol] += quantity; // must be thread safe
}

With the above logic, if there is no parallelism in the outer loop over angles, then the equivalent images array is allocated only once, and its elements are collected for each projection angle. The collection must be done in any case.

jimdempseyatthecove · ‎10-29-2009

Assume thread pool ofP threads

Your current method requires

M * (N/P) number of (create OpenMP thread team/discharge thread team (implicit barrier)) and one instance of outxy

The technique I am trying to guide you towards is

1 number of (create OpenMP thread team/discharge thread team (implicit barrier)) and P instances of outxy

Now it may be the case that the time spent in (create OpenMP thread team/discharge thread team (implicit barrier)) is a very small percentage of the overalltime to compute a projection angleand not worth going after.

However,

If the outer loop is set with a chunk size of 1 (i.e. each thread picks the next image projection angle, then computes the pojection without benefit of other threads) then there is no implicit synchronization barrier.

What this means is:

Consider the app humming along with 8 or 16 or 32, ...threads computing projection angles (one at a time using P threads), consider other app, say 1 thread,runs on system taking computation resources. This one thread will compete with one of your applications threads (or one at a time). This is going to introduce a completion time skew for that thread (or threads) slice of the inner loop outxy array. This in turn requires the other threads finishing first to either burn time in KMP_BLOCKTIME or go to sleep then wakeup (neither of which is productive)

Same situation with outer loop on M, chunk 1, (each thread does one projection angle) the interrupted thread still takes longer to complete the current projection angle, the other threads do not wait (no burn time, no sleep time). The thread being interrupted (preempted) will likely result in that thread processing fewer projection angles than the other threads. The only adverse situation is when M is relatively small (in which case you parallize the inner loop).

Jim Dempsey

smithpd · ‎10-29-2009

OK, I understand the idea, but P instances of outxy is not so good in my opinion, for reasons explained below.

outxy is a large array of size N x N, where N is typically 4000 and now I am working with 8080. So outxy is 8080 x 8080 x 4 bytes / pixel = 261,145,600 bytes or say 250 MB. With other stuff going on plus code size, the code runs in 1 GB when solving a single problem. For this large problem P is 8, so that would gives us 2 GB. That is doable with 64 bits, but I am not crazy about the concept. I have a better way to use memory, which is MPI....

FYI, this is a hybrid method. I also have MPI running as well. OpenMP and MPI are optional, selected in the build configuration. I can have several MPI jobs running on one board and OpenMP running at the same time. It is limited by memory - small memory / job --> more MPI jobs. With proper selection of number of MPI jobs, most memory is occupied and, with any luck most CPU cycles are being utilized. What openMP does in this case is to fill in the cracks if the CPU is being underutilized. That way the CPU is always at 100%. I did not mention this before because it was not crucial to finding and fixing my bug.

Anyway, I find that with 8 threads this program runs about 8 x faster than with a single thread non-OpenMP case. That is with the excessive thread creation / discharges you have noted. In view of this, it seems to me that the thread overhead is minimal.

Does this change your view of it?

I will be gone until Monday.

jimdempseyatthecove · ‎10-30-2009

Yes that fills the cracks.

Make the outer loop distribute the NxN array work across MPI (even to same system), make the inner loop OpenMP. On memory tight system with several cores, the OpenMP will speed things up.

If you can work it such that the NxN arrays are "sticky" with respect to the MPI nodes then you might get by by passing MPI messages containing the array ID in place of the contents of the array. Your NAS filesystem might be able to do some of this for you.

Jim