- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
using OpenMP. Running on Core 2 Extreme XC8600 and
Core 2 Quad QX6700. The code segment below works most
of the time but occasionally gives a wrong result.
============CODE SEGMENT=============
!
!$OMP PARALLEL DEFAULT(SHARED) PRIVATE(tnumber,nst,pswk1) REDUCTION(+:rho,tau,currnt,sodens,spinden,kinvden,spincur)
!
tnumber = OMP_GET_THREAD_NUM()
nst = (npmin(iq) - 1) + (tnumber + 1) + (ido - 1)*maxthreads
if (nst npsi(iq)) then
stop " processor index not in wavefunction index range"
end if
!
call densit (lpsi, ncolx, ncoly, ncolz, nst, iq, vocc, rho, tau, currnt,&
sodens, spinden, kinvden, spincur, der1x, der1y, der1z,&
psi(1,1,1,1,nst,iq), pswk1, itheta, itimrev)
!
!$OMP END PARALLEL
======================================
The arrays in reduction statement are all calculated in
the subroutine from private copies of the psi() array and
do not interact with each other.
I have tried ifort 9.1.037 and on another similarly
configured machine running 9.1.040. I am using Fedora
Core 6 with all the updates.
Thanks
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
When you have a large enough sum to motivate use of sum reduction, single precision typically isn't sufficient for satisfactory results, or to avoid visible differences associated with number of threads. Many large applications take specific actions to maintain consistency.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
In addition, I am not talking about small deviations but
rather large ones. For example the density rho would
integrate to 64.0000 but suddenly it becomes 63.0000
and other quantities that integrate to e-26 level become
e-6. Again, in some runs which execute the above
thousands of times everything works perfectly. On other
occasions (starting the same executable under identical
conditions) it suddenly messes up during one iteration.
Almost like it does not do the complete sum.
Thanks
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The number of threads available to a parallel region may be less than the maximum number of threads. Use OMP_GET_NUM_THREADS() issued inside the parallel region instead of maxthreads obtained outside of the parallel region. This may not be your bug this time, but it may be at a later time. This, though, may goof up your striping of the array being processed by densit. Therefore more work is required to handle the situation when the number of threads available to the parallel region is less than the maximum number of threads.
Examine densit to verify the DEFAULT(SHARED) variables lpsi, ncolx, ncoly, ncolz, iq, vocc, der1x, der1y, der1z, psi(array), itheta, and itimrev won't present a problem when called in parallel.
And what bites me now and again is assuming that what looks like stack local storage is in subroutine static storage.
real :: vec(3)
is either on stack or in static storage depending on compiler options. If densit contains something like that then consider changing to
real, automatic :: vec(3)
Then you do not have to worry about the compiler switch that places vectors on stack.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I will do a number of changes and post my findings in a
couple of days. First, I will define a new private pswk4
array and load it with the segment of the psi() array
for each thread. All other variables are just shared and
not changed at all in the subroutine. Other than that the
subroutine has only a few local scalar variables (no
arrays). I can use the automatic clause but using -openmp
implies -automatic to my understanding.
Happy New Year!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
wrong I came back to one of your suggestions. The
problem was that the environment variable OMP_DYNAMIC
was set to TRUE. Consequently during the run the system
occasionally changed to number of availably threads for
performance reasons.
Setting OMP_DYNAMIC to FALSE solves the problem.
Thank you all
Sammy
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I caution you against assuming maxthreads are always available to a parallel region. Fix the code now while you are focused on this problem. This way, as you later make improvements to your progam the problem won't resurface and take you or your successor on an unnecessary debugging session.
Jim
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The problem is the logic of dividing up the number of
states among threads change, which means I need to
think about this a little more.
Thanks
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Look at using SCHEDULE with Chunk. You can specify the chunk size to the total size divided by maxthreads. This way, regardless of the number of available threads the array is divided up accordingly. If, for example, maxthreads = 4 but you have 3 threads available. Then optimally the code will run 2x faster(3 at 1/4, plus 1 at 1/4 in an additional iteration). YMWV
Your program need not stop if the number of available threads does not equal the number of expected threads.
Jim
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
problems:
1. The compiler seems to not give an error message if an OMP line is
longer that the maximum line length (say 132). It may be compiling
them correctly but I believe it should complain.
2. My long loops work fine, however I have one loop that is just length
2 and I would like to run it on two processors/threads. The loop is:
!$OMP PARALLEL NUM_THREADS(2) DEFAULT(SHARED) PRIVATE(iq,tpsi)
!$OMP DO SCHEDULE(static)
do iq = 1, 2
call schmid (iq, lpsi, ncolx, ncoly, ncolz, npsi, npmin, spnorm, psi, tpsi, itimrev, wxyz)
end do
!$OMP END DO
!$OMP END PARALLEL
This always segfaults but this one always works:
nthreads = 2
call OMP_SET_NUM_THREADS(nthreads)
!$OMP PARALLEL DEFAULT(SHARED) PRIVATE(tnumber,iq,tpsi)
tnumber = OMP_GET_THREAD_NUM()
iq = tnumber + 1
if(iq > 2) stop "Wrong iq for schmid"
call schmid (iq, lpsi, ncolx, ncoly, ncolz, npsi, npmin, spnorm, psi, tpsi, itimrev, wxyz)
!$OMP END PARALLEL
is the NUM_THREAD() option implemented? Because this runs correctly if
I use gfortran from gcc-4.1.1.
Thanks
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Umar,
1. When I am experiencing too long of !$OMP statement it is usualy due to too many variables appearing on SHARED or PRIVATE. When the statement gets very long (e.g. wider that fits in the edit window of the IDE) it is a good time to cleanup the code.
If SHARED is causing the problem with too many references to stack local variables then it becomes a good time to create a derived typefor acontainer and use that to pass your arguments.
If SHARED is causing the problem with too many references to global variables (in modules or COMMON)then it becomes a good time to create a subroutine out of the code enclosed in the parallel section.
If PRIVATE is causing the problem with too many references then it is a good time to create a subroutine out of the code enclosed in the parallel section.
Note, on !$OMP DO, and !$OMP PARALLELthe FortranDO and END DO are not part of the enclosed parallel section, instead they are part of the adjacent (and enclosing)!$OMP statements.
2. Try using the Chunk parameter to SCHEDULE on the !$OMP DO.
Using chunk of 1 has each thread gulping in a group of 1 indicies of the do loop.
!$OMP PARALLEL NUM_THREADS(2) DEFAULT(SHARED) PRIVATE(iq,tpsi)
!$OMP DO SCHEDULE(static, 1)
do iq = 1, 2
or
nthreads = 2
call OMP_SET_NUM_THREADS(nthreads)
!$OMP PARALLEL DO SCHEDULE(static,1) DEFAULT(SHARED) PRIVATE(iq,tpsi)
do iq = 1, 2
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I may have found a compiler bug or missing something in my understanding.
I can reproduce this with a sample program. I shall paste the program
below. When you compile this and run, the "write(*,*) iq,nst" from the first
two parallel sections show iq as 0, where it should clearly be 1 or 2.
On the other hand if you go down to the last paralled section and change
all iq's to ix, than the top two paralled sections give the correct values for
iq. Here is the test program:
program test
implicit none
integer :: OMP_GET_NUM_THREADS, OMP_GET_THREAD_NUM
integer :: OMP_GET_MAX_THREADS, OMP_GET_NUM_PROCS
integer :: nthreads, tnumber
integer :: nprocs, maxthreads
integer :: iq, nst, ix
!-----------------------------------------------------------------------
! determine/set the number of processors on each node for openmp
!-----------------------------------------------------------------------
call OMP_SET_DYNAMIC(.FALSE.)
nprocs = OMP_GET_NUM_PROCS()
call OMP_SET_NUM_THREADS(nprocs)
maxthreads = OMP_GET_MAX_THREADS()
if (nprocs /= maxthreads) then
write(*,*) "WARNING: Number of processors different than number of threads"
endif
!
iq = 1
!
!$OMP PARALLEL NUM_THREADS(maxthreads) DEFAULT(SHARED) PRIVATE(nst)
!$OMP DO SCHEDULE(static,1)
do nst = 1, 24
write(*,*) iq,nst
end do
!$OMP END DO
!$OMP END PARALLEL
!
iq = 2
!
!$OMP PARALLEL NUM_THREADS(maxthreads) DEFAULT(SHARED) PRIVATE(nst)
!$OMP DO SCHEDULE(static,1)
do nst = 1, 24
write(*,*) iq,nst
end do
!$OMP END DO
!$OMP END PARALLEL
!
!$OMP PARALLEL NUM_THREADS(2) DEFAULT(SHARED) PRIVATE(iq)
!$OMP DO SCHEDULE(static,1)
do iq = 1, 2
write(*,*) iq
end do
!$OMP END DO
!$OMP END PARALLEL
!
end program test
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I havn't run your test but I think I see theproblem.
The !$OMP PARALLEL...DEFAULT(SHARED) decalares a parallel block where iq is shared. However !$OMP DO creates a new (nested) parallel block where iq becomes whatever the default is. So iq is "shared" for zero statements between the nested parallel sections and it appears a private copy of iq is created for the DO. The!$OMP DO should have SHARED(iq) or DEFAULT(SHARED)
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I ran your test in W_FC_C_9.1.032 in Windows XP x32 and this does seem to be a code generation problem. I rewrote the code as
program testimplicit noneinteger :: OMP_GET_NUM_THREADS, OMP_GET_THREAD_NUMinteger :: OMP_GET_MAX_THREADS, OMP_GET_NUM_PROCSinteger :: nthreads, tnumberinteger :: nprocs, maxthreadsinteger :: iq, nst, ix!-----------------------------------------------------------------------
! determine/set the number of processors on each node for openmp
!-----------------------------------------------------------------------call OMP_SET_DYNAMIC(.FALSE.)
nprocs = OMP_GET_NUM_PROCS()call OMP_SET_NUM_THREADS(nprocs)
maxthreads = OMP_GET_MAX_THREADS()if (nprocs /= maxthreads) thenwrite(*,*) "WARNING: Number of processors different than number of threads"endif
!
iq = 1
!
!$OMP PARALLEL DO DEFAULT(SHARED) COPYIN(iq) PRIVATE(nst)do nst = 1, 24write(*,*) iq,nstend do
!$OMP END PARALLEL DO
!
iq = 2
!
!$OMP PARALLEL DO DEFAULT(SHARED) COPYIN(iq) PRIVATE(nst)do nst = 1, 24write(*,*) iq,nstend do
!$OMP END PARALLEL DO
!
!$OMP PARALLEL NUM_THREADS(2) DEFAULT(SHARED) PRIVATE(iq)
!$OMP DO SCHEDULE(static,1)do iq = 1, 2write(*,*) iqend do
!$OMP END DO
!$OMP END PARALLEL
!end program test
And the dissassembly snippets
iq = 1
0040110A C7 45 D0 01 00 00 00 mov dword ptr [ebp-30h],1
...
write(*,*) iq,nst
00401281 C7 85 38 FF FF FF 00 00 00 00 mov dword ptr [ebp+FFFFFF38h],0
0040128B 8B 45 D0 mov eax,dword ptr [ebp-30h]
The iq in both sections is referenced at [ebp-30h]. ebp is the stack frame pointer.
What is not shown here is ebp is different for the main program level
and the inner OMP DO level (as it should be). The offset (-30h) shouldn't be the same
unless by chance the variable placements were forced to be at the same offsets.
I reworked the program to the followingprogram testuse omp_libimplicit noneinteger :: iq, nst, ix
!-----------------------------------------------------------------------
! determine/set the number of processors on each node for openmp
!-----------------------------------------------------------------------
!
iq = 1
!
!$OMP PARALLEL DO DEFAULT(SHARED) PRIVATE(nst)do nst = 1, 24write(*,*) iq,nstend do
!$OMP END PARALLEL DO
!call do1(iq)
iq = 2
!
!$OMP PARALLEL DO DEFAULT(SHARED) PRIVATE(nst)do nst = 1, 24write(*,*) iq,nstend do
!$OMP END PARALLEL DO
!
!$OMP PARALLEL NUM_THREADS(2) DEFAULT(SHARED) PRIVATE(iq)
!$OMP DO SCHEDULE(static,1)do iq = 1, 2write(*,*) iqend do
!$OMP END DO
!$OMP END PARALLEL
!end program test
subroutinedo1(iq)
integer:: iq, nst
!$OMP PARALLEL DO DEFAULT(SHARED) PRIVATE(nst)do nst = 1, 24write(*,*) iq,nstend do
!$OMP END PARALLEL DO
end subroutinedo1
The code of the 1st do loop was copied to a subroutine and called
following the 1st loop. The subroutine, performing the
same statements excepting that iq is a dummy argument. nst
is a stack local argument to the subroutine.
The subroutine produces the correct results. Therefore it appears
that the compiler is producing bad code.
I performed an additional test changing the !$OMP PARALLEL DO to
!
!$OMP PARALLEL DO SHARED(iq) COPYIN(iq) PRIVATE(nst)do nst = 1, 24write(*,*) iq,nstend do
!$OMP END PARALLEL DO
!call do1(iq)
Now both produce the correct result.
It seams like there is a problem with DEFAULT and/or in
combination with COPYIN.
The following produces incorrect results
!$OMP PARALLEL DO DEFAULT(SHARED) COPYIN(iq) PRIVATE(nst)
Whereas using
!$OMP PARALLEL DO SHARED(iq) COPYIN(iq) PRIVATE(nst)
I suggest filing an issue with premier support.
I haven't tested this on 9.1.033
I have that installed on my notebook I will see if I can test this on 9.1.033 tonight.
***
Note the USE OMP_LIB
You really need to use the USE as the interface blocks are required for some of the functions to work.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
i386 and em64t versions.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
1. The culprit is the use of iq as the do loop variable in the last parallel
section. If we remove the last parallel section than the first two
sections print the correct value of iq.
2. The bad test codes give different answers for -O0 and -On for n>0.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
So in a sense your sample program was using iq as both shared and private within the scope of the !$OMP PARALLEL.
I don't know if this is a bug or a violation of programming rules. I do think the compiler should at least issue a warning "Ambiguous use of SHARED or PRIVATE".
My personal opinion is a private iq should have been created for the last loop.
Jim
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
SHARE should only apply to the parallel sections, and they mostly seem
to do. Only in this context when it applies to a do loop index and when this
index name is used as a reqular variable at some other place in the code.
I believe it it were used as a do loop index again at other locations it would
be ok.
Anyway, we'll see what the folks at Intel say.
Thank you for your help.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page