!DEC$ PARALLEL - Page 2

davidspurr · ‎01-07-2008

Language Ref states that !DEC$ PARALLEL "enables auto-parallelization for an immediately following DO loop".

Does this apply to an outer loop that has many other loops & subroutine calls etc within it? ie. each cycle of an outer loop processed in a separate thread, even if there is a substantial amount of code within the loop.

That would seem a very simple means of parallel execution & in my case should speed execution significantly (quad core, x64), since I have many 1000's of sites of independent activity. However, when I tried it I see no increase in speed, with CPU usage rarely exceeding 25% - 27%.

David

TimP · ‎01-10-2008

If any elements of Res() are accessed by multiple threads, you have a "race" condition, where changes in the order of updating would account for inconsistent results. Running Thread Checker with your data sets should verify this.

jimdempseyatthecove · ‎01-10-2008

David,

Learning how to crawl...

Your original code had

...code outer before
do I=1,Icount
... code inner
end do
... code outer following

The crawl method

module MOD_outer
... shared variables
end module MOD_outer

subroutine WAS_original
use MOD_outer
...code outer before
!$OMP PARALLEL DO
do I=1,Icount
call WAS_code_inner(I)
end do
!$OMP PARALLEL DO
... code outer following
end subroutine WAS_original

subroutine WAS_code_inner(I)
use MOD_outer
integer, intent(IN)::I
real :: A,B,C! local vars
! ****** use AUTOMATIC for thread local arrays *****
real, AUTOMATIC ::Array(100)
integer :: J,K,L! localvars
... code inner
end subroutine WAS_code_inner

Step 1: Rework the code and compile _without_ parallelization. Get code to work and compare run times. Runtime of reworked code should be almost identical to older single threaded code. If not, then something has to account for the difference (e.g. different compiler options). Resolve those differences to your satisfaction.

Step 2: Immediately in front of the !$OMP PARALLEL DO insert

CALL OMP_SET_NUM_THREADS(1)

to force the PARALLEL DO to use only 1 thread. Compile as parallel, rerun the performance test. This too should produce almost the same runtimes. If this does not then some investigation is warranted to determine why the slow run time.

Step 3: Comment out the call to set the number of threads (or change to use 2 threads). Run the performance test again. If the performance goes down then something is causing an interference between the threads. Using a profiler (timer based sampling) will (hopefully) show where the threads are slugging it out.

You should be aware that there are some runtime system library calls that are run in a critical section.READ and WRITE are obvious, ALLOCATE and DEALLOCATEwill be serialized, but other things not so obviouse such as the functions that return random numbers. As an example, the profiler will typicallynot show that you are in the random number generator, instead it will likely show that you are in a routine performing a SpinLock. You can find the location of the SpinLock code then while running, randomly set a break point. Look at the call stack. If that is not descriptive enough then use the step out until you reach the Fortran code. This may take a few iterations of remove break point, continue, set break point, diagnose, repeat until done.

Jim Dempsey

jimdempseyatthecove · ‎01-10-2008

David,

RE: tim18
>>If any elements of Res() are accessed by multiple threads, you have a "race" condition, where changes in the order of updating would account for inconsistent results. Running Thread Checker with your data sets should verify this.<<

What Tim is referring to is what is called Temporal dependencies. Example:

Res(N) = Expression(Res(N-1))

Where you cannot compute the N'th result prior to computing the (N-1)'th result. There are many other coding conditions that are sensitive to sequence of operations.

Counter example

Res(N) = Expression(Res(N+1))

Where you must use the Old value of the next cell in the output array. In this case you do not want a different thread to compute the next value of Res(N+1) prior to you using its old value. In this circumstance you can create a 2nd result array such as ResNew to hold the new results while Resmaintains the old results

call Initialize(Res)
while(.not. Done)
!$OMP PARALLEL DO
DO I=1,Icount
ResNew(i) = fn(Res,I,...)
END DO
!$OMP END PARALLEL DO
Res = ResNew
end while

Depending on the complexity of the code the Thread Checker might not be able to detect the Temporal dependency. You, being familar with the code, should know of these issues.

Jim Dempsey

davidspurr · ‎01-10-2008

Tim & Jim especially - many thanks for the detailed explanations. They are helping a lot.

Two brief specific questions from my posts last night:

+ Do I need to declare xFUNCTION private?
+ In my simplified code example, is Res(:,:) OK remaining SHARED?

###

Will review my code in the light of your comments shortly.

A couple of points I can clarify

1. Yes, elements of Res() will be updated by multiple threads. It is simple aggregation so the order that they are summed **should** have no impact (but yes, maybe I need to look at the precision issue - adding small bits to a large total ! I did not expect the total for the individual elements to be an issue, but I will review it more closely now).

2. There is no temporal dependency in Res()

3. I am certain there are no runtime system library calls in the parallel loop or the function called. Certainly no READ, WRITE, ALLOCATE, LOG, or RANDOM type stuff. Just simple add, subtract, multiply & divide, if & DO.

Have NUM_THREADS(1) case running now but it will be an hour or so before it finishes. CPU usage is ~25% so appears to be working correctly.

This has made me look more closely at the function which obviously consumes a significant chunk of the CPU time. It is a simple interpolation routine but its a generic one I use elsewhere. I now realise some of the generality is not required in this instance (sequence interpolated in this case always monotonically increases), so I have now written a specifc version with some of the code culled. Looking at the current progress of the NUM_THREADS(1) case it may be providing a c. 12% saving overall.

davidspurr · ‎01-10-2008

Ran several variations.

+ Aggregation of Res(). Order of summation confirmed not an issue; ie. changing Res() from single to double precision had only a very minor impact (diff <0.05% within range of interest). Hence changed summation order due to threading should have no impact.

+ NUM_THREADS(1) case produced identical answers to the non-threaded case.

+ Total analysis time very similar for NUM_THREADS(1) and the non-threaded cases. Hence (to my surprise) no OMP overhead. In fact time for NUM_THREADS(1) was marginally less (<1% difference, but consistent for both single & double precision runs).

+ NUM_THREADS(3) case results 5% - 7% lower than NUM_THREADS(1) / non-threaded case results (same differnce for both double or single precision), so something is still wrong.

+ Total analysis time again 17.3% less than non-threaded case (but CPU running 3x harder).

Will persevere a little more, but will soon need to get back to more pressing issues :-(.

Thanks for all the help.
David

davidspurr · ‎01-10-2008

From scouring the manual & the internet it is clear that Res(j,iloc(k)) = Res(j,iloc(k))+ p is the offending statement.

eg from http://www.openmp.org/presentations/miguel/F95_OpenMPv1_v2.pdf clause 3.1.8 (p43), it it clear that unpredictable results will occur unless a REDUCTION clause is used.

In OpenMP v1 it appears only a scalar or an array element is permitted within a REDUCTION clause. The document notes that the computational overhead would be very large for a large array.

It seems v2 and IVF do allow arrays(?), but not of deferred shape or assumed size.

Unfortunately my array (Res) is declared allocatable in a module and is allocated in a different subroutine, so the compiler throws a syntax error (deferred shape or assumed size not permitted).

I 'restructured' the code so that Res() & its dimensions came into the subroutine via the arguement list, then attempted to include a REDUCTION(+: Res) clause. However, that still throws a compiler error. In this case I only get a single line error statement

Error 1 Compilation Aborted (code 1)

with no further details (except it does show the name of the file at fault - ie. the one that includes the subroutine with the OMP stuff). Not a lot of help! Compiler error disappears if I remove the "REDUCTION(+: Res)" spec from the !$OMP statement.

Stumped !

David

[EDIT:] Build log includes a little more info:

fortcom: Fatal: There has been an internal compiler error (C0000005)

Also see http://www.hpcvl.org/faqs/mpi/OpenMP.html item 4.

6th paragraph below the example code:

" Finally, we instruct the compiler to treat the value of mys specially. The reduce(+:mys) instruction causes a private value for mys to be initiatialized with the current mys value before thread creation. After all loop iterations have been completed, the different private values are reduced to a single on by a sum (+ sign in the directive)."

davidspurr · ‎01-11-2008

Further digging solved the problem of invalid results. As Tim suggested, there was a "race" condition in the Res(j,iloc(k)) = Res(j,iloc(k)) + p statement. That prevented Res() from being updated in all DO loop iterations.

Solution was to precede the statement by the !$OMP ATOMIC directive, since REDUCTION could not be used for an array. With !$OMP ATOMIC set, the results for the NUM_THREADS(3) case are now identical to the non-threaded solution.

However, performance takes a further small hit. Total runtime for 3 threads is now only 12% less than the non-threaded case (at the expense on 3x the CPU demand). Looks like the heavy premium paid for the quad core might have been somewhat mis-spent!

Maybe there is a way to improve the preformance further but the gains seem relatively limited. Not clear if ATOMIC locks the whole array (ie. in effect, by locking the asignment), or just locks the "active" array element (so other thread could update other elements at the same time. Hopefully the latter, but I have my doubts.

Pity that REDUCTION cannot be applied at least to small arrays. In my case, Res() is typically (25,6) or smaller. Storing 3 temp local copies would seem trivial & should be more efficient than ATOMIC or other restrictions.

David

jimdempseyatthecove · ‎01-11-2008

David,

If you are requiring an ATOMIC on Res(j,iloc(k)) = Res(j,iloc(k)) + p then it would appear than multiple threads are updating the same cell in Res. Your prior explinations were not clear (to me) that multiple threads would be sharing the same locations in Res.

If Res is an accumulation array that gets updated many times per cell then do as your last paragraph suggests and create multiple Res arrays and consolidate them on termination of loops.

real :: Res(Nx, Ny)
real :: ResLocal(Nx, Ny)
...
Res = 0.0
!$OMP PARALLEL DO PRIVATE(ResLocal)
do I=1,NumberIterations
ResLocal = 0.0
call DoWork(ResLocal, I)
!$OMP CRITICAL
Res = Res + ResLocal
!$OMP END CRITICAL
end do
!$OMP END PARALLEL DO

The above assumes DoWork runs a relatively long time.

Jim Dempsey

Steven_L_Intel1 · ‎01-11-2008

David,

You may want to download a trial of the Intel Thread Profiler. It can help you visualize your application's use of threads and point to specific lines of code that are causing stalls.

davidspurr · ‎01-14-2008

Jim

Sorry if I did not make the situation clear. I tried to indicate that multiple threads update the same cells by using "iloc(k)" in "Res(j,iloc(k))" and by the comments that I thought the likely source of the error was in "the final summation at the end of the loop" and that "The result sumation is basically the same as R = R + p where all threads aggregate the same R." (post 30246740).

I gather from your post that the key terminology should have been "accumulation" rather than "summation" or "aggregate".

I suspect the biggest savings will be if I can parallel code the outer loop I started with in my original post. That will require parts of the code to be restructured & will require more time than I have at the moment. But I am keen to get the parallel code working so will re-visit the situation sometime in the next a couple of months. Both Tim's and your comments have been very helpful and I now have a much better appreciation of what I need to do.

Thanks to all who have helped
David

jimdempseyatthecove · ‎01-15-2008

David,

As a general rule, the further out you begin the parallization the better the performance (memorize parallel outer - vector inner). That said, as you discovered there are ordering and interaction issues that must be addressed. As you get into parallel programming you will get accustomed to the issues.

The important concepts that you have just learned are:

If multiple threads update a location then an interlocking method is required.
Interlocking methods are not "free" (have overhead)
Sequence of executionmay beimportant
and other issues addressed in this thread

To summarize

If the amount of computation time is significant as compared to the interlocking overhead then choose the simpler coding method containing ATOMIC or CRITICAL section.

If the amount of computation time is small compared to the interlocking overhead then choose a more complex coding method that avoids or reduces the interlocking overhead.

Get your application working first, then address the optimization issues later. This will give you a base line performance and also provide the reference data (keep a copy of the original code in a seperate project area so you can produce different test data as needed).

Good luck,

Jim Dempsey

davidspurr · ‎01-17-2008

Jim

As per new thread I persevered a little more & at last made worthwhile progress.

Thanks,
David