Inconsistent reduction of arrays

umar · ‎12-30-2006

I have a very large program parts of which are parallelized
using OpenMP. Running on Core 2 Extreme XC8600 and
Core 2 Quad QX6700. The code segment below works most
of the time but occasionally gives a wrong result.

============CODE SEGMENT=============
!
!$OMP PARALLEL DEFAULT(SHARED) PRIVATE(tnumber,nst,pswk1) REDUCTION(+:rho,tau,currnt,sodens,spinden,kinvden,spincur)
!
tnumber = OMP_GET_THREAD_NUM()
nst = (npmin(iq) - 1) + (tnumber + 1) + (ido - 1)*maxthreads
if (nst npsi(iq)) then
stop " processor index not in wavefunction index range"
end if
!
call densit (lpsi, ncolx, ncoly, ncolz, nst, iq, vocc, rho, tau, currnt,&
sodens, spinden, kinvden, spincur, der1x, der1y, der1z,&
psi(1,1,1,1,nst,iq), pswk1, itheta, itimrev)
!
!$OMP END PARALLEL
======================================

The arrays in reduction statement are all calculated in
the subroutine from private copies of the psi() array and
do not interact with each other.

I have tried ifort 9.1.037 and on another similarly
configured machine running 9.1.040. I am using Fedora
Core 6 with all the updates.
Thanks

TimP · ‎12-30-2006

When you have a large enough sum to motivate use of sum reduction, single precision typically isn't sufficient for satisfactory results, or to avoid visible differences associated with number of threads. Many large applications take specific actions to maintain consistency.

umar · ‎12-30-2006

I forgot to mention that all arrays are double precision.

In addition, I am not talking about small deviations but
rather large ones. For example the density rho would
integrate to 64.0000 but suddenly it becomes 63.0000
and other quantities that integrate to e-26 level become
e-6. Again, in some runs which execute the above
thousands of times everything works perfectly. On other
occasions (starting the same executable under identical
conditions) it suddenly messes up during one iteration.
Almost like it does not do the complete sum.

Thanks

jimdempseyatthecove · ‎12-30-2006

The number of threads available to a parallel region may be less than the maximum number of threads. Use OMP_GET_NUM_THREADS() issued inside the parallel region instead of maxthreads obtained outside of the parallel region. This may not be your bug this time, but it may be at a later time. This, though, may goof up your striping of the array being processed by densit. Therefore more work is required to handle the situation when the number of threads available to the parallel region is less than the maximum number of threads.

Examine densit to verify the DEFAULT(SHARED) variables lpsi, ncolx, ncoly, ncolz, iq, vocc, der1x, der1y, der1z, psi(array), itheta, and itimrev won't present a problem when called in parallel.

And what bites me now and again is assuming that what looks like stack local storage is in subroutine static storage.

real :: vec(3)

is either on stack or in static storage depending on compiler options. If densit contains something like that then consider changing to

real, automatic :: vec(3)

Then you do not have to worry about the compiler switch that places vectors on stack.

Jim Dempsey

umar · ‎12-31-2006

Thank you very much for your analysis.

I will do a number of changes and post my findings in a
couple of days. First, I will define a new private pswk4
array and load it with the segment of the psi() array
for each thread. All other variables are just shared and
not changed at all in the subroutine. Other than that the
subroutine has only a few local scalar variables (no
arrays). I can use the automatic clause but using -openmp
implies -automatic to my understanding.

Happy New Year!

umar · ‎01-02-2007

Well, after many hours of trying to figure out what was
wrong I came back to one of your suggestions. The
problem was that the environment variable OMP_DYNAMIC
was set to TRUE. Consequently during the run the system
occasionally changed to number of availably threads for
performance reasons.

Setting OMP_DYNAMIC to FALSE solves the problem.
Thank you all
Sammy

jimdempseyatthecove · ‎01-04-2007

I caution you against assuming maxthreads are always available to a parallel region. Fix the code now while you are focused on this problem. This way, as you later make improvements to your progam the problem won't resurface and take you or your successor on an unnecessary debugging session.

Jim

umar · ‎01-05-2007

For now I put in a check and stop with a comment.
The problem is the logic of dividing up the number of
states among threads change, which means I need to
think about this a little more.
Thanks

jimdempseyatthecove · ‎01-08-2007

Look at using SCHEDULE with Chunk. You can specify the chunk size to the total size divided by maxthreads. This way, regardless of the number of available threads the array is divided up accordingly. If, for example, maxthreads = 4 but you have 3 threads available. Then optimally the code will run 2x faster(3 at 1/4, plus 1 at 1/4 in an additional iteration). YMWV

Your program need not stop if the number of available threads does not equal the number of expected threads.

Jim

umar · ‎01-09-2007

I have been implementing your suggestions and ran into some other
problems:

1. The compiler seems to not give an error message if an OMP line is
longer that the maximum line length (say 132). It may be compiling
them correctly but I believe it should complain.

2. My long loops work fine, however I have one loop that is just length
2 and I would like to run it on two processors/threads. The loop is:

!$OMP PARALLEL NUM_THREADS(2) DEFAULT(SHARED) PRIVATE(iq,tpsi)
!$OMP DO SCHEDULE(static)
do iq = 1, 2
call schmid (iq, lpsi, ncolx, ncoly, ncolz, npsi, npmin, spnorm, psi, tpsi, itimrev, wxyz)
end do
!$OMP END DO
!$OMP END PARALLEL

This always segfaults but this one always works:

nthreads = 2
call OMP_SET_NUM_THREADS(nthreads)
!$OMP PARALLEL DEFAULT(SHARED) PRIVATE(tnumber,iq,tpsi)
tnumber = OMP_GET_THREAD_NUM()
iq = tnumber + 1
if(iq > 2) stop "Wrong iq for schmid"
call schmid (iq, lpsi, ncolx, ncoly, ncolz, npsi, npmin, spnorm, psi, tpsi, itimrev, wxyz)
!$OMP END PARALLEL

is the NUM_THREAD() option implemented? Because this runs correctly if
I use gfortran from gcc-4.1.1.
Thanks

jimdempseyatthecove · ‎01-10-2007

Umar,

1. When I am experiencing too long of !$OMP statement it is usualy due to too many variables appearing on SHARED or PRIVATE. When the statement gets very long (e.g. wider that fits in the edit window of the IDE) it is a good time to cleanup the code.

If SHARED is causing the problem with too many references to stack local variables then it becomes a good time to create a derived typefor acontainer and use that to pass your arguments.

If SHARED is causing the problem with too many references to global variables (in modules or COMMON)then it becomes a good time to create a subroutine out of the code enclosed in the parallel section.

If PRIVATE is causing the problem with too many references then it is a good time to create a subroutine out of the code enclosed in the parallel section.

Note, on !$OMP DO, and !$OMP PARALLELthe FortranDO and END DO are not part of the enclosed parallel section, instead they are part of the adjacent (and enclosing)!$OMP statements.

2. Try using the Chunk parameter to SCHEDULE on the !$OMP DO.

Using chunk of 1 has each thread gulping in a group of 1 indicies of the do loop.

!$OMP PARALLEL NUM_THREADS(2) DEFAULT(SHARED) PRIVATE(iq,tpsi)
!$OMP DO SCHEDULE(static, 1)
 do iq = 1, 2

or

 nthreads = 2
 call OMP_SET_NUM_THREADS(nthreads)
!$OMP PARALLEL DO SCHEDULE(static,1) DEFAULT(SHARED) PRIVATE(iq,tpsi)
do iq = 1, 2


Jim Dempsey

umar · ‎01-11-2007

Jim,

I may have found a compiler bug or missing something in my understanding.
I can reproduce this with a sample program. I shall paste the program
below. When you compile this and run, the "write(*,*) iq,nst" from the first
two parallel sections show iq as 0, where it should clearly be 1 or 2.

On the other hand if you go down to the last paralled section and change
all iq's to ix, than the top two paralled sections give the correct values for
iq. Here is the test program:

program test
implicit none
integer :: OMP_GET_NUM_THREADS, OMP_GET_THREAD_NUM
integer :: OMP_GET_MAX_THREADS, OMP_GET_NUM_PROCS
integer :: nthreads, tnumber
integer :: nprocs, maxthreads
integer :: iq, nst, ix
!-----------------------------------------------------------------------
! determine/set the number of processors on each node for openmp
!-----------------------------------------------------------------------
call OMP_SET_DYNAMIC(.FALSE.)
nprocs = OMP_GET_NUM_PROCS()
call OMP_SET_NUM_THREADS(nprocs)
maxthreads = OMP_GET_MAX_THREADS()
if (nprocs /= maxthreads) then
write(*,*) "WARNING: Number of processors different than number of threads"
endif
!
iq = 1
!
!$OMP PARALLEL NUM_THREADS(maxthreads) DEFAULT(SHARED) PRIVATE(nst)
!$OMP DO SCHEDULE(static,1)
do nst = 1, 24
write(*,*) iq,nst
end do
!$OMP END DO
!$OMP END PARALLEL
!
iq = 2
!
!$OMP PARALLEL NUM_THREADS(maxthreads) DEFAULT(SHARED) PRIVATE(nst)
!$OMP DO SCHEDULE(static,1)
do nst = 1, 24
write(*,*) iq,nst
end do
!$OMP END DO
!$OMP END PARALLEL
!
!$OMP PARALLEL NUM_THREADS(2) DEFAULT(SHARED) PRIVATE(iq)
!$OMP DO SCHEDULE(static,1)
do iq = 1, 2
write(*,*) iq
end do
!$OMP END DO
!$OMP END PARALLEL
!
end program test

jimdempseyatthecove · ‎01-11-2007

I havn't run your test but I think I see theproblem.

The !$OMP PARALLEL...DEFAULT(SHARED) decalares a parallel block where iq is shared. However !$OMP DO creates a new (nested) parallel block where iq becomes whatever the default is. So iq is "shared" for zero statements between the nested parallel sections and it appears a private copy of iq is created for the DO. The!$OMP DO should have SHARED(iq) or DEFAULT(SHARED)

Jim Dempsey

jimdempseyatthecove · ‎01-11-2007

I ran your test in W_FC_C_9.1.032 in Windows XP x32 and this does seem to be a code generation problem. I rewrote the code as

program test

implicit none

integer :: OMP_GET_NUM_THREADS, OMP_GET_THREAD_NUM

integer :: OMP_GET_MAX_THREADS, OMP_GET_NUM_PROCS

integer :: nthreads, tnumber

integer :: nprocs, maxthreads

integer :: iq, nst, ix

!-----------------------------------------------------------------------

! determine/set the number of processors on each node for openmp

!-----------------------------------------------------------------------

call OMP_SET_DYNAMIC(.FALSE.)

nprocs = OMP_GET_NUM_PROCS()

call OMP_SET_NUM_THREADS(nprocs)

maxthreads = OMP_GET_MAX_THREADS()

if (nprocs /= maxthreads) then

write(*,*) "WARNING: Number of processors different than number of threads"

endif

iq = 1

!$OMP PARALLEL DO DEFAULT(SHARED) COPYIN(iq) PRIVATE(nst)

do nst = 1, 24

write(*,*) iq,nst

end do

!$OMP END PARALLEL DO

iq = 2

!$OMP PARALLEL DO DEFAULT(SHARED) COPYIN(iq) PRIVATE(nst)

do nst = 1, 24

write(*,*) iq,nst

end do

!$OMP END PARALLEL DO

!$OMP PARALLEL NUM_THREADS(2) DEFAULT(SHARED) PRIVATE(iq)

!$OMP DO SCHEDULE(static,1)

do iq = 1, 2

write(*,*) iq

end do

!$OMP END DO

!$OMP END PARALLEL

end program test

And the dissassembly snippets

iq = 1

0040110A C7 45 D0 01 00 00 00 mov dword ptr [ebp-30h],1

...

write(*,*) iq,nst

00401281 C7 85 38 FF FF FF 00 00 00 00 mov dword ptr [ebp+FFFFFF38h],0

0040128B 8B 45 D0 mov eax,dword ptr [ebp-30h]

The iq in both sections is referenced at [ebp-30h]. ebp is the stack frame pointer.

What is not shown here is ebp is different for the main program level

and the inner OMP DO level (as it should be). The offset (-30h) shouldn't be the same

unless by chance the variable placements were forced to be at the same offsets.

I reworked the program to the following

program test

use omp_lib

implicit none

integer :: iq, nst, ix

!-----------------------------------------------------------------------

! determine/set the number of processors on each node for openmp

!-----------------------------------------------------------------------

iq = 1

!$OMP PARALLEL DO DEFAULT(SHARED) PRIVATE(nst)

do nst = 1, 24

write(*,*) iq,nst

end do

!$OMP END PARALLEL DO

call do1(iq)

iq = 2

!$OMP PARALLEL DO DEFAULT(SHARED) PRIVATE(nst)

do nst = 1, 24

write(*,*) iq,nst

end do

!$OMP END PARALLEL DO

!$OMP PARALLEL NUM_THREADS(2) DEFAULT(SHARED) PRIVATE(iq)

!$OMP DO SCHEDULE(static,1)

do iq = 1, 2

write(*,*) iq

end do

!$OMP END DO

!$OMP END PARALLEL

end program test
subroutine
do1(iq)integer
:: iq, nst

!$OMP PARALLEL DO DEFAULT(SHARED) PRIVATE(nst)

do nst = 1, 24

write(*,*) iq,nst

end do

!$OMP END PARALLEL DO

end subroutine
do1

The code of the 1st do loop was copied to a subroutine and called

following the 1st loop. The subroutine, performing the

same statements excepting that iq is a dummy argument. nst

is a stack local argument to the subroutine.

The subroutine produces the correct results. Therefore it appears

that the compiler is producing bad code.

I performed an additional test changing the !$OMP PARALLEL DO to

!$OMP PARALLEL DO SHARED(iq) COPYIN(iq) PRIVATE(nst)

do nst = 1, 24

write(*,*) iq,nst

end do

!$OMP END PARALLEL DO

call do1(iq)

Now both produce the correct result.

It seams like there is a problem with DEFAULT and/or in

combination with COPYIN.

The following produces incorrect results

!$OMP PARALLEL DO DEFAULT(SHARED) COPYIN(iq) PRIVATE(nst)

Whereas using

!$OMP PARALLEL DO SHARED(iq) COPYIN(iq) PRIVATE(nst)

I suggest filing an issue with premier support.

I haven't tested this on 9.1.033
I have that installed on my notebook I will see if I can test this on 9.1.033 tonight.

***

Note the USE OMP_LIB

You really need to use the USE as the interface blocks are required for some of the functions to work.

Jim Dempsey

umar · ‎01-12-2007

I can reproduce these results all the way upto 9.1.040 and on both
i386 and em64t versions.

umar · ‎01-12-2007

Two more observations for clarification:

1. The culprit is the use of iq as the do loop variable in the last parallel
section. If we remove the last parallel section than the first two
sections print the correct value of iq.

2. The bad test codes give different answers for -O0 and -On for n>0.

jimdempseyatthecove · ‎01-13-2007

So in a sense your sample program was using iq as both shared and private within the scope of the !$OMP PARALLEL.

I don't know if this is a bug or a violation of programming rules. I do think the compiler should at least issue a warning "Ambiguous use of SHARED or PRIVATE".

My personal opinion is a private iq should have been created for the last loop.

Jim

umar · ‎01-13-2007

I filed this as a bug with the premier support. I think the PRIVATE and
SHARE should only apply to the parallel sections, and they mostly seem
to do. Only in this context when it applies to a do loop index and when this
index name is used as a reqular variable at some other place in the code.
I believe it it were used as a do loop index again at other locations it would
be ok.

Anyway, we'll see what the folks at Intel say.
Thank you for your help.