Problem with OpenMP addition into a shared array

smithpd · ‎10-20-2009

Solution found

I have solved this problem myself, with a little discussion with a colleague. The first implementation described in this post is not thread safe. The second implementation described below in post #4 is thread safe. The fundamental principle is that you should not allow two threads to add into (or write) one array element at the same time. I was deluded into thinking that the first implementation would work because I was specifying that an "array" was shared. In fact, only the pointer to the array was shared, not the actual data in the array, which had been allocated using malloc. The problem was evidently caused by collisions between two threads trying to modify the same data element of the array.

This may be of interest to others who encouter similar issues.

Although a solution has been found, a secondary issue is that it might be nice if either OpenMP or the Intel compiler provided a mechanism to protect the elements of an array from collision, or at least give a warning if they were not protected. Or do they?

Original problem - see below for solution

I have some OpenMP code that is trying to perform additions into the elements of a shared 2-D array, but it creates random errors, as if the threads were interfering / conflicting with one another. The code that performs the additions is in an external function. A pointer to the shared array is passed into the external function. The calling function is parallel, but the external function that does the additions is not. At issue is how a single iteration of a parallel for loop executes the external function and whether multiple threads can cause interference in the addition into the shared array. If I make the external code section that performs the addition omp critical, the random error goes away, but so does the speed gain -- parallism is destroyed. If I run the code with the #pragma omp commented out, it also works, again destroying parallel speed.

The logic is described below. Obviously it is wrong or not thread safe. Please help me fix this if you can.
Sorry, I don't know how to make my indents stick.

existing logic (not exact code)
--------------------------------------------------
excerpt from calling function, containing parallel for loop
--------------------------------------------------

float var1, var2;
int i, j;
float** outxy; // shared accumulator

//... function allocate_matrix() defines row pointers so outxy may be indexed as outxy
outxy = allocate_matrix(rows, columns);
//... zero out accumulator
for (i = 0; i < rows; i++)
{
for (j = 0; j < columns; j++)
{
outxy = 0;
}
}
//... calculate local variables that are independent of loop index
var1 = xxx;

#pragma omp parallel for shared(outxy, nloop, rows, columns) private(iloop, var1, var2)
for (iloop = 0; iloop < nloop; iloop++) // intent is to have one thread for each iloop index
{
//... calculate local variables that depend on iloop index
var2 = yyy;
//... for this value of iloop, perform addition into outxy elements using external function accumulate()
accumulate(outxy, iloop, rows, columns, var1, var2); // called in parallel: each call is by one thread?
}

//... process (outxy)
free_matrix(outxy)
.
.
.

------------------------------------------------
external function that does accumulation
------------------------------------------------

void accumulate(float** out, int loop, int rowmax, int colmax, float x, float y)
{
int i, j;
//... calculate local variables for this function
var = f1(x, y, loop);
//... perform addition for one value of "loop" in calling function
for (i = 0; i < rowmax; i++)
{
for (j = 0; j < colmax; j++)
{
//... add stuff into an element of out matrix
stuff = f2(i, j, var);
out += stuff;
}
}
}

------------------------

Thanks in advance for your help.

JenniferJ · ‎10-20-2009

TheIntel Parallel Studio (the Parallel Insepctor)(Windows only) seems a good tool for this. The Parallel Inspector will find out any possible race condition etc. Also the parallel debugger extension might be useful here to debug the issue.

what do you mean "external function"? is it in another .dll?

Thanks,
Jennifer

smithpd · ‎10-20-2009

Quoting - Jennifer Jiang (Intel)

TheIntel Parallel Studio (the Parallel Insepctor)(Windows only) seems a good tool for this. The Parallel Inspector will find out any possible race condition etc. Also the parallel debugger extension might be useful here to debug the issue.

what do you mean "external function"? is it in another .dll?

Thanks,
Jennifer

Be external function, I mean only call to another C++ function. It is probably bad terminology. The function is compiled along with the others. There are no DLLs. The point is, a function is called within the parallel for loop, but its code is not contained explicitly within the parallel for loop. The question is, is this function call thread safe, and if not how do I make it so?

I don't have Parallel Studio. I have only the compiler version 11.

The problem with using a debugger is that I do not know what to look for at this point. I could simply put print statements in if I did, but the errors are random incorrect pixels in mostly good pixels, and I am not sure I can catch it. I was hoping that someone could look at the code logic and determine that it does or does not meet OpenMP requirements.

smithpd · ‎10-20-2009

I have been working on this while waiting for a reply, and I have some additional information. I found another way that works but is not ideal to me. If I take the "#pragma parallel for" out of the outer loop in the calling program and put a "#pragma parallel for" outside the double loop in the function accumulate() function, then the accumulate function processes each row of "out" in a separate thread, and each thread sums only over columns of the"out" array.

I imagine that somthing about passing the argument outxy to the function accumulate() is causing the parallelism to break. I suspect that it has something to do with the fact that the loops are not actually acting on the same piece of memory, or writing into a 2-D array like that (pointers to pointers) is not thread safe. I tried putting "#pragma omp flush" at various places and this did not work, either.

I don't really like this solution because:

(1) it is about 20% slower than before because with non-parallel accumulate() more work is being done uninterrupted by eash thread;

(2) it seems logical to make accumulate() a separate function that does not have to be parallelized.

If anyone can explain to me why the function call in the original code example, in which the entire outxy array is incremented non-parallel by one thread by the accumulate() function, does not work, and propose a fix, I would be grateful.

In the meanwhile, I will forge ahead and report improved results if I obtain them.

smithpd · ‎10-20-2009

Second implementation that is thread safe

OK, I think I am done with this, and I think I have an explanation.

The second implementation, putting the #pragma omp parallel for in the accumulate function, so the threads are looping over rows of outxy, guarantees that no element outxy will be added into at the same time. This is because each row is done by a separate thread, and no two threads operate on the same row. So, it is thread safe.

The first implementation, where the threads controlled the outer loop, allowed two threads to add into the same elerment outxy at the same time, so it was not thread safe. The failure was random because this happened only occasionally. As a matter of fact, sometimes the results were perfect, but usually there were random bad pixels.

The funny thing is that this code has been in existence, in a thread-unsafe state, for several years, and this problem was not noticed. The reason for that is that we have been running it on older computers until now, and we were using an older version of the compiler. With that situation, eveything worked as we thought it should. Now, with new computers and the new compiler, the old code failed.

If anyone has any input as to exactly why the first implementation used to work and now it doesn't, please add to this thread. In the meanwhile, I am sticking to the second implementation that works.

For the record, I the outline code for the one that works is listed below.

--------------------------------------------------
revised logic (not exact code)
--------------------------------------------------
excerpt from calling function. no longer parallel
--------------------------------------------------

float var1, var2;
int i, j;
float** outxy; // shared accumulator

//... function allocate_matrix() defines row pointers so outxy may be indexed as outxy
outxy = allocate_matrix(rows, columns);
//... zero out accumulator
for (i = 0; i < rows; i++)
{
for (j = 0; j < columns; j++)
{
outxy = 0;
}
}
//... calculate local variables that are independent of loop index
var1 = xxx;

for (iloop = 0; i < nloop; i++)
{
//... calculate local variables that depend on iloop index
var2 = yyy;
//... for this value of iloop, perform addition into outxy elements using function accumulate()
accumulate(outxy, iloop, rows, columns, var1, var2);
}

//... process (outxy)
free_matrix(outxy)
.
.
.

--------------------------------------------------------------------------
function that does accumulation. Parallel in loop over rows of out matrix
---------------------------------------------------------------------------

void accumulate(float** out, int loop, int rowmax, int colmax, float x, float y)
{
int i, j;
//... calculate local variables for this function
var = f1(x, y, loop);
//... perform addition for one value of "loop" in calling function
#pragma omp parallel for shared(out, rowmax, colmax, var) private(i, j, stuff)
for (i = 0; i < rowmax; i++)
{
for (j = 0; j < colmax; j++)
{
//... add stuff into an element of out matrix
stuff = f2(i, j, var);
out += stuff;
}
}
}

------------------------

Thomas_Z_Intel · ‎10-21-2009

Quoting - smithpd

[...]
I don't have Parallel Studio. I have only the compiler version 11.

The problem with using a debugger is that I do not know what to look for at this point. I could simply put print statements in if I did, but the errors are random incorrect pixels in mostly good pixels, and I am not sure I can catch it. I was hoping that someone could look at the code logic and determine that it does or does not meet OpenMP requirements.

Intel Compiler Pro v11.1 comes with the very same Parallel Debug Extensions to the Microsoft VS Debugger as in Intel Parallel Studio (Windows) that should help you to detect exactly such problems in OpenMP code. The Linux version contains a GUI-driven Debugger with same parallel debug capabilities as on Windows. For details please see http://software.intel.com/en-us/articles/parallel-debugger-extension/ (Windows) or http://software.intel.com/en-us/articles/idb-linux/ (Linux).
Although you seem to have found a solution by now, I'd encourage you to try these debugger features and would be very interested to hear back from you if you find them useful.

Thanks, Thomas

jimdempseyatthecove · ‎10-21-2009

In the first case you had thread distribution on outer loop index level and with multiple threads accumulating into the same out array with the potential of multiple threads concurrently attempting out += stuff; using the same i and j.

In the second case you moved the scope of the parallel for deeper into your nesting level deeper by one level (to the for(i=0... loop). This corrected the two threads read/modify/writing to the same cell of out.

This will work fine as long as rowmax is relatively large (or happens to be an integral multiple of the number of threads). When rowmax is small consider adding an index to out such that you perform

out[loop] += stuff;

Then after the outer parallel for (at loop level) has complete then consolidate all out[loop]'s where loop>0 into loop==0.

Use outer parallel for on index i, and inner two loops with loop and j.

Jim Dempsey

smithpd · ‎10-21-2009

Quoting - jimdempseyatthecove

In the first case you had thread distribution on outer loop index level and with multiple threads accumulating into the same out array with the potential of multiple threads concurrently attempting out += stuff; using the same i and j.

In the second case you moved the scope of the parallel for deeper into your nesting level deeper by one level (to the for(i=0... loop). This corrected the two threads read/modify/writing to the same cell of out.

This will work fine as long as rowmax is relatively large (or happens to be an integral multiple of the number of threads). When rowmax is small consider adding an index to out such that you perform

out[loop] += stuff;

Then after the outer parallel for (at loop level) has complete then consolidate all out[loop]'s where loop>0 into loop==0.

Use outer parallel for on index i, and inner two loops with loop and j.

Jim Dempsey

Thanks, Jim. Hey, is your cove the La Jolla cove? I used to live near there.

I have two comments about that extra index you suggested.

First, FYI, rowmax is large in our case, so the extra index will not work. Typically rowmax is 1000 - 4000 and we store 16 bits per pixel, so the image we are accumulating into is up to 32 MB at present. We are about to try a 8000 x 8000 problem, a factor for 4 greater. The outer loop is presently over, say, 1000-2000 partial images we are adding together, so we would need to store up to 64 GB in memory now and four times that for the 8000 problem!

Second, I want to be sure that the second method is always thread safe. I think that the way I have done the second implementation, no more than one thread can ever access one row of the accumulation matrix outxy. The omp parallel for loop is over rows. As I understand it, this means that a single thread is assigned to any given row, is it not? Then there is a default synchronizing barrier (wait) for threads to complete at the end of the omp parallel for loop. If so, isn't the second method guaranteeed to be thread safe?

Some other thoughts about this problem in general....

Contemplating this after the fact, I think the reason the first implementation failed is that the shared variable was a pointer to the data, not the data itself. All examples I see on the web use scalar variables, which I think are protected against collision if they are shared. If you making repetitive operations on the elements of an array, I am unaware of any way that OpenMP can protect the entire array using the "shared" key word. For some reason unknown to me, the first implementation did actually work on our older computers with an older version of the compiler. It suddenly started failing with some newer computers and the newer compiler. I don't know which or why. It could be that the compiler recognized the pointer as pointing to an array, or maybe the older computer treated cache differently. In any event, the non-OpenPM branch of the code works fine with the first implementation because we do not have to worry about the multiple thread collision.

I wonder if OpenMP has, or plans to have, a method to protect arrays from collisions. I wonder if Intel could make some switch that recognizes an array, of maybe they did so and abandoned it in version 11.

FYI, I tried "#pragma omp critical" on the line in the inner loop of the accumulation. It gave correct results but took a long time to compute, 9 times longer than with no OpenMP whatsoever. Setting and releasing locks in the inner loop is obviously not a good idea.

jimdempseyatthecove · ‎10-21-2009

I live in what used to be my father's house known in the area as "The Cove".

What you might consider doing then is method 3..

Create n arrays of out, one for each thread team member number. This will reduce the storage requirement from (1000:2000) * 32MB to n * 32MB. Then after you perform the outer most parallel for loop reduce the n arrays into array[0].

4th method (requires no additional memory)

Just inside the parallel for loop (that runs through immages) divide the out array into n slices (one slice for each thread), derive the slice number from the omp_get_thread_num(). Note, you can use a firstprivate using an external to for loopint threadNum=-1; and internal to for loop if(threadNum < 0) threadNum = omp_get_thread_num(); (or use TLS). Or bear the overhead of the function calls. With array out divided up in this manner no two threads will access the same cells.

If you find threads finishing up at a large skew then set schedule(static, 1) or try schedule(dynamic). Try schedule(dynamic). first.

Don't use critical, it will slow you down. Program to avoid critical.

Jim Dempsey

smithpd · ‎10-21-2009

Thanks again, Jim. Those are some interesting ideas, and you have enriched my understanding of OpenMP. I really think, though, that implementation #2, looping over rows of the out matrix, is best for me. That is because (a) it is the simplest code, which can be understood at a glance, (b) it requires no additional memory, and (c), I think now that the additional run time compared to the first (incorrect) method is nil, or at least not large enough to warrant making the code more complicated.

So, I have done #2, and it works fine.

Thanks again.

jimdempseyatthecove · ‎10-22-2009

One other quick and dirty (clean) experiment

add

#pragma omp parallel (your shared/private here)
{ // scope of parallel region
for(... outer loop here
{
// (omit parallel from line))
#pragma omp for
for(... // inner parallel loop as you have now
{
...
} // end of #pragma omp for
} // end of for(... outer loop here
} // end scope of parallel region

What the above will do is reduce the numberof timesa thread pool isassembled/dissassembled.

Jim Dempsey.

jimdempseyatthecove · ‎10-22-2009

You may need to add #pragma omp barrier too

#pragma omp parallel (your shared/private here)
{ // scope of parallel region
for(... outer loop here
{
// (omit parallel from line))
#pragma omp for
for(... // inner parallel loop as you have now
{
...
} // end of #pragma omp for
// barrier should be implicit here
// if you have problems add
#pragma omp barrier
} // end of for(... outer loop here
} // end scope of parallel region

jimdempseyatthecove · ‎10-22-2009

#pragma omp parallel (your shared/private here)
{ // scope of parallel region
for(... outer loop here
{
#pragma omp master
{

}
#pragma omp barrier
// (omit parallel from line))
#pragma omp for
for(... // inner parallel loop as you have now
{
...
} // end of #pragma omp for
// barrier should be implicit here
// if you have problems add
#pragma omp barrier
} // end of for(... outer loop here
} // end scope of parallel region

smithpd · ‎10-26-2009

I was off on Friday. I'll try those ideas today.

jimdempseyatthecove · ‎10-26-2009

The last method is not friendly to other apps running on the system (but you can vary the block time if this becomes an issue).

Jim Dempsey

smithpd · ‎10-26-2009

Jim,

Of the three solutions you proposed above, I tried the first and the second. The barrier was not needed, but each of these solutions was slower than my implementation #2, in which the outer loop had no OpenMP and the multi-threading was only in the outer for of the inner loop. The time to run your recent suggestions was about 13% greater than the time for my implementation #2.

I am not sure exactly what is causing the slowdown, but I think it could be two things. First, I did not list the complete code in the problem statement of this thread because I was trying to simplify the description here and find something that would work, which I did. One factor missing from my description is that, with OpenMP in the outer loop, I had to malloc and free memory for another array inside the outer loop so that data which was read into that array (in the loop) could be private to a thread. That repeated malloc / free takes time. With no OpenMP in the outer loop, I can share that memory array and malloc / free it outside the loop. A second possibility is that OpenMP must transfer all of its thread information into the function accumulate(), which contains the inner loops.

In addition to eliminating the memory allocation in the outer loop and passing thread information, another advantage of my implementation #2 is that now I can return from within the outer loop if there is an error in the inner loop. I could not do that with OpenMP. Both this and the malloc / free required a duplicate section of code to implement OpenMP in the outer loop, which as been elimininated with my implementation #2.

So, for all these reasons, I think my implementation #2 is the preferred solution. It is faster, simpler, and more functional. My previous dislike of implementation #2 was clearly unfounded.

Thanks again for your help and interest in this problem.

jimdempseyatthecove · ‎10-26-2009

The malloc and free were likely the cause of the slowdown. Do the malloc once (or once each time the temporary array needs to grow). Error bailout can be done with a flag.

#pragma omp parallel
{
double* YourArray = new double[yourSize];
bool error=false;
#pragma omp for
for(int n=0; n {
if(error) { n=N; continue; }
// read in
...
if(error) { n=N; continue; }
... // process buffer
}
if(YourArray)delete [] YourArray;
}

I would think your bigest choke point might be not being able to overlap the I/O with processing.
Look at the OpenMP tasking available in the 3.0 spec. This might make it a little easier to overlap the I/O with the processing (e.g. double buffer the input and output).

This can be well worth the effort (assuming your wait time is absurdly long).

Jim Dempsey

Jim

smithpd · ‎10-26-2009

Thanks again, Jim.

In your last example, does the outer for loop (#pragma omp for) branch to mutiple threads, each operating on different index values n? If so, the different threads cannot read into the same array because the array is a function of n. Each needs its own private storage for the buffer. That is why I put the malloc inside the loop. Am I missing something? If they do not branch to multiple theads, what is the point of informing omp about this outer loop?

I suppose the next step is to allocate nthreads arrays and rotate between them. But really, as I indicated previously, I think that there is little to gain by adding complexity. Another factor I should mention is that the time to load the data from disk into buffer is small compared with the time to do the accumulation in the inner loop, which involves some arithmetic operations, so messing with the outer loop is not too productive.

jimdempseyatthecove · ‎10-27-2009

#pragma omp parallel

Begins parallel region (all threads execute)

Each instance of "double* YourArray;" is inside parallel region and therefore is thread private.

#pragma omp for

*** within parallel region ***
Slices iteration space of next foramongst threads
#pragma omp for
for(int n=0; n{

Each thread has seperate copy of n
and
each thread has different slice of iteration space (0,N]

You do not need to deal with threadprivate since each copy of YourArray (pointer) is on private stack of thread.

In above, you may (or may not)need enclose portion(s) of your read-in in a critical section. e.g. if all datasets for n stored in single large file (each bite of file in serial but all other code in parallel).

Jim Dempsey

Jim Dempsey

smithpd · ‎10-27-2009

Jim,

You suggested in post #16 that I could allocate the "buffer" between the "#prama omp parallel" and the "#pragma omp for". I tried this and it didn't work properly. It ran and gave correct results, but it allocated only one array, it ran in only one thread, and so OpenMP was defeated. Code and results are shown below.

The only conditions in which I can get it to run in multiple (8) threads are:
(1) I allocate the memory inside the "#pragma omp for" loop, and
(2) I put no statements whatsoever between the #prama omp parallel" and the "#pragma omp for"
Observation (2) makes me wonder what is the use of spltting the "#prama omp parallel" and the "#pragma omp for".

I did this testing with a stripped down version of my actual code. Below are code listings of the relevant sections for both inside-loop and outside-loop memory allocation. Each code case is followed by some output from the code. If you have some time, please let me know why this is not working properly when allocation is done outside the for loop.

Recall that I have a solution that works, implementation #2, in which this outer loop is not parallelized and the OpenMP is moved further down in the code. So, this investigation is mainly of academic interest. It might, however, improve my understanding of OpenMP.

Code for allocation inside for loop (runs in 8 threads)

abort = 0;
c_rows_save = 0;
int tmax, tnum;
#pragma omp parallel default(none) private(c_rows, rawImageEquiv, sin_count, tnum) shared(tmax, inpt, lookup, out_xy, fc, angles, sino_index, cout, cout_prefix, abort, c_rows_save)
//....Loop over angles
#pragma omp for
for(c_rows = 0; c_rows < angles; c_rows++)
{
tmax = omp_get_num_threads( );
tnum = omp_get_thread_num( );
rawImageEquiv = (f_type **)malloc( fc.total_sins * sizeof(f_type*));
if ( rawImageEquiv == NULL )
cout << cout_prefix << "rawImageEquiv allocation FAILED inside loop for thread "
<< tnum << " of " << tmax << ", row " << c_rows << " of " << angles << endl;
else
cout << cout_prefix << "rawImageEquiv allocated successfully inside loop for thread "
<< tnum << " of " << tmax << ", row " << c_rows << " of " << angles << endl;

//....Set rawImageEquiv pointers to point to the raw image rows
//....applicable to the particular angle (c_rows)
for( sin_count = 0; sin_count < fc.total_sins; sin_count++ )
{
rawImageEquiv[sin_count] = sino_index[sin_count].image[c_rows];
}
//....Back project at an angle. Fan beam done with in-plane geometry
if (abort == 1)
{
if (rawImageEquiv) free(rawImageEquiv);
continue;
}
if( !feld_project_angle(inpt, lookup, out_xy, c_rows, fc, rawImageEquiv) )
{
if (abort == 0)
{
c_rows_save = c_rows;
abort = 1;
}
}
if (rawImageEquiv) free(rawImageEquiv);
} // end loop over angles
if (abort == 1) //....Free all storage and return in error mode
{
cout << cout_prefix << "ERROR> feldkamp projection failed for angle " << c_rows_save << endl;
return 0; //.... Must return outside the loop for OMP
}
cout << cout_prefix << "Completed loop over angles, sinograms, OpenMP." << endl;

Output of interest from allocation inside loop

rawImageEquiv allocated successfully inside loop for thread 2 of 8, row 180 of 720
rawImageEquiv allocated successfully inside loop for thread 3 of 8, row 270 of 720
rawImageEquiv allocated successfully inside loop for thread 1 of 8, row 90 of 720
rawImageEquiv allocated successfully inside loop for thread 0 of 8, row 0 of 720
rawImageEquiv allocated successfully inside loop for thread 4 of 8, row 360 of 720
rawImageEquiv allocated successfully inside loop for thread 7 of 8, row 630 of 720
rawImageEquiv allocated successfully inside loop for thread 6 of 8, row 540 of 720
rawImageEquiv allocated successfully inside loop for thread 5 of 8, row 450 of 720
rawImageEquiv allocated successfully inside loop for thread 1 of 8, row 91 of 720
rawImageEquiv allocated successfully inside loop for thread 3 of 8, row 271 of 720
rawImageEquiv allocated successfully inside loop for thread 7 of 8, row 631 of 720
...
rawImageEquiv allocated successfully inside loop for thread 5 of 8, row 539 of 720
rawImageEquiv allocated successfully inside loop for thread 7 of 8, row 719 of 720
rawImageEquiv allocated successfully inside loop for thread 1 of 8, row 179 of 720
Completed loop over angles, sinograms, OpenMP.
Run time of slice reconstruction = 3.156 seconds.

Code for allocation outside loop (runs in only one thread)

abort = 0;
c_rows_save = 0;
int tmax, tnum;
#pragma omp parallel default(none) private(c_rows, rawImageEquiv, sin_count, tnum) shared(tmax, inpt, lookup, out_xy, fc, angles, sino_index, cout, cout_prefix, abort, c_rows_save)
tmax = omp_get_num_threads( );
tnum = omp_get_thread_num( );
rawImageEquiv = (f_type **)malloc( fc.total_sins * sizeof(f_type*));
if ( rawImageEquiv == NULL )
cout << cout_prefix << "rawImageEquiv allocation FAILED outside loop for thread "
<< tnum << " of " << tmax << endl;
else
cout << cout_prefix << "rawImageEquiv allocated successfully outside loop for thread "
<< tnum << " of " << tmax << endl;
//....Loop over angles
#pragma omp for
for(c_rows = 0; c_rows < angles; c_rows++)
{
//....Set rawImageEquiv pointers to point to the raw image rows
//....applicable to the particular angle (c_rows)
for( sin_count = 0; sin_count < fc.total_sins; sin_count++ )
{
rawImageEquiv[sin_count] = sino_index[sin_count].image[c_rows];
}
//....Back project at an angle. Fan beam done with in-plane geometry
if (abort == 1)
{
continue;
}
if( !feld_project_angle(inpt, lookup, out_xy, c_rows, fc, rawImageEquiv) )
{
if (abort == 0)
{
c_rows_save = c_rows;
abort = 1;
}
}
} // end loop over angles
if (rawImageEquiv) free(rawImageEquiv);
if (abort == 1) //....Free all storage and return in error mode
{
cout << cout_prefix << "ERROR> feldkamp projection failed for angle " << c_rows_save << endl;
return 0; //.... Must return outside the loop for OMP
}
cout << cout_prefix << "Completed loop over angles, sinograms, OpenMP." << endl;

Output of interest, allocation outside loop

rawImageEquiv allocated successfully outside loop for thread 0 of 8
outside allocation, thread 0 of 8, row 0 of 720
outside allocation, thread 0 of 8, row 1 of 720
outside allocation, thread 0 of 8, row 2 of 720
outside allocation, thread 0 of 8, row 3 of 720
outside allocation, thread 0 of 8, row 4 of 720
outside allocation, thread 0 of 8, row 5 of 720
outside allocation, thread 0 of 8, row 6 of 720
outside allocation, thread 0 of 8, row 7 of 720
outside allocation, thread 0 of 8, row 8 of 720
outside allocation, thread 0 of 8, row 9 of 720
outside allocation, thread 0 of 8, row 10 of 720
...
outside allocation, thread 0 of 8, row 717 of 720
outside allocation, thread 0 of 8, row 718 of 720
outside allocation, thread 0 of 8, row 719 of 720
Completed loop over angles, sinograms, OpenMP.
Run time of slice reconstruction = 23.891 seconds.

jimdempseyatthecove · ‎10-28-2009

Add braces {} around outer parallel region

#pragma omp parallel...
{
...
#pragma omp for
for(...
{
}
...
} // end pragma omp parallel...

Without the braces the scope of the parallel region is the next statement (just the tmax= line)

Jim Dempsey