openMP bug in intel 11.1.064 ?

stefanoz · ‎04-28-2011

Hi,

I experienced a problem (bug?) with the openMP code attached below. I tested the code on different architectures: dual socket quad core Intel Xeon E5450 and dual socket quad core AMD opteron experiencing the same problem.

I tried to be as simple as possible in the test code. I would write a code which guaranties me to safely enter to the last loop (see code) without issuing omp barriers. (...waiting for thread subteams in next versions of openMP...)

I use intel compiler and OMP_NUM_THREADS=8

[zorro@pordoi212 ~]$ ifort -V

Intel Fortran Intel 64 Compiler Professional for applications running on Intel 64, Version 11.1 Build 20100414 Package ID: l_cprof_p_11.1.072

When compiled without any other flag than -openmp, it doesn't provide the right answer (*)

[zorro@pordoi212 ~]$ ifort -openmp testflush.F90 -o tfint

[zorro@pordoi212 ~]$ ./tfint

Test (must be 0): 1

if I add -C flag (or -g), it works fine

[zorro@pordoi212 ~]$ ifort -openmp -g testflush.F90 -o tfint

[zorro@pordoi212 ~]$ ./tfint

Test (must be 0): 0

[zorro@pordoi212 ~]$ ifort -openmp -C testflush.F90 -o tfint

[zorro@pordoi212 ~]$ ./tfint

Test (must be 0): 0

When compiling with pgi 10.3, the code is right (same with gcc)

[zorro@pordoi212 ~]$ pgf90 -mp testflush.F90 -o tfpgi

[zorro@pordoi212 ~]$ ./tfpgi

Test (must be 0): 0

If you compile on your environment, you can see that in the wrong case (*), listed times in fort.102 file reports that thread 2 (3 to 7 also in fort.103-107) doesn't wait for thread 1 completion (during checking loop)

times (write,loop) : 2.348423004150391E-004 9.536743164062500E-007

times (total) : 2.357959747314453E-004

In other cases, reported times confirm the right behaviour (threads 2 to 7 wait for thread 1 completion), an example

times (write,loop) : 2.3508071899414062E-004 1.000094890594482

times (total) : 1.000329971313477

Some comments in the code add other details to the problem.

Regards,

Stefano

----------------Here is the code------------------------------------

program main

implicit none

include 'omp_lib.h'

logical checkflag

integer i,iam,arraysize,auxsl,nth,error,chunk,k

double precision t(3)

!logical, volatile, allocatable :: checkvarv(:) !same behaviour with volatile attribute

logical, allocatable :: checkvarv(:)

integer, allocatable :: array(:)

arraysize=1000

allocate(checkvarv(arraysize))

checkvarv(:)=.false.

allocate(array(arraysize))

array=0

error=0

!$omp parallel default(none) shared(error,arraysize,nth,checkvarv,array) private(iam,checkflag,auxsl,t,i,k)

iam=OMP_GET_THREAD_NUM()

auxsl=0

checkflag=.false.

!$omp single

nth=OMP_GET_NUM_THREADS()

!$omp end single

t(1)=OMP_GET_WTIME()

! Thread 0 performs some MPI operation....

if(iam.eq.0) call sleep(2)

! Misalign thread 1 with respect to other threads when writing to array

!$omp do schedule(dynamic,1)

do i=1,arraysize

if(iam.eq.1.and.auxsl.eq.0) then

auxsl=1

call sleep(1)

endif

array(i)=iam+1

checkvarv(i)=.true.

enddo

!$omp end do nowait

t(2)=OMP_GET_WTIME()

! Loops until previous write loop is finished

do while(.not.checkflag)

!!$omp flush(checkvarv) ! if you uncomment this line, you obtain the right answer with -openmp flag only.

checkflag=.true.

do k=1,arraysize

checkflag = (checkflag.and.checkvarv(k))

end do

t(3)=OMP_GET_WTIME()

! Can we enter this loop safely?

!$omp do schedule(dynamic,1)

do i=1,arraysize

if(array(i).eq.0) error=1

enddo

!$omp end do nowait

write(100+iam,*) 'times (write,loop) : ',t(2)-t(1),t(3)-t(2)

write(100+iam,*) 'times (total) : ',t(3)-t(1)

!$omp end parallel

write(*,*) "Test (must be 0): ",error

write(200,*) array-1

deallocate(array)

deallocate(checkvarv)

end

--------------------------------------------------------------------------------

jimdempseyatthecove · ‎04-29-2011

Try inserting:

!$omp flush(array)

after

! Can we enter this loop safely

However, instead consider placing

!$omp flush(array)
!$omp flush(checkvarg)

immediately after the

!$omp end do nowait

that fills in the array.

N.B. I specified two seperate flush(..) in hope that the flushes occur as sequenced. Placing both variables(arrays) in the same statement is ambiguous to order. IA36/Intel64 should preserve write ordering however the compiler optimizations might resequence instructions and/or use streaming writes and/or write merging. To check for these conditions, examine the results of your

write(200,*) array-1

If you see any -1's then a write was not issued or bunged up with SSE merge (read, modify (with mask), write).
If you see all 0:nth-1's then the issue is with the flush.

Jim Dempsey

stefanoz · ‎04-30-2011

Your suggestions doesn't work. I put also them together in the code but it gives me the wrong output. I think the issue is with flush.

jimdempseyatthecove · ‎04-30-2011

On IA32, Intel64 and AMD FLUSH(var)acts asa compiler directive

if var is registered then
if register of var modified then
write to memory
end if
disassociate var from register
end if

The write ordering should be preserved.

Due to all checkvar being set and varified as .true. (indicated by all threads passing the test) this indicates that all threads passed through their slice of

array(i) = iam+1
checkvar(i) = .true.

and thus is indicitive of one or more of

1) cache coherency system not correctly performing write combining
(I seriously doubt this would be true)

2) the allocatable array isn't aligned to at least integer granularity
(I seriously doubt this would be true)

3) compiler optimizations is batching up writes into larger GP register or SSE register then performing merge (read/modify/write) either at start of slice and/or end of slice

4) bad code

Can you produce an assembler listing then attach to reply to this forum?

Jim Dempsey

jimdempseyatthecove · ‎05-02-2011

Stefano,

Thanks for the assembly listings. I haven't investigated thoroughly (I'm not from Intel) but I see some flow control problems with your testflush1.s listing. Assuming this was generated from your first message then we see:

[cpp]..B1.56:                        # Preds ..B1.44 ..B1.50
        xorl      %eax, %eax                                    #52.12
        call      omp_get_wtime_                                #52.12
                                # LOE r13 r12d r14d xmm0
..B1.131:                       # Preds ..B1.56
        movsd     %xmm0, -128(%rbp)                             #52.12
                                # LOE r13 r12d r14d
..B1.57:                        # Preds ..B1.131
        xorl      %eax, %eax                                    #61.12
        call      omp_get_wtime_                                #61.12
                                # LOE r13 r12d r14d xmm0
..B1.132:                       # Preds ..B1.57
        movsd     %xmm0, -104(%rbp)                             #61.12
[/cpp]

Two of the omp_get_wtime_ function calls are made back-to-back with no interviening code.
Meaning this code section did not follow the sequence of operations (or equivilent to the sequence of operations)in the original source code.

The write loop for filling in the array and chedkvar values was sequenced properly.

If the code flow for the code around (between) your loops is incorrect then it is possible that you could incorrectly sequence the test for array fill done.

Again, I wish to state I did not examine the code flow further than the first error.

Additional note:

The two omp_get_wtime_ function calls above store into

-128(%rbp) and -104(%rbp)

the third call stores into -112(%rbp)

These three stores are to corrolat to

double precision t(3)

The above array should encompass 3 * real(8)

If -128(%rbp) is t(1), then t(2) should be -120(%rbp), t(3) should be -112(%rbp), t(4) at -104(%rbp)
IOW the array addresses for t are messed up. Now the compiler optimizations could determine that array t is only unsed within this program and each cell referenced is unique (i.e. not passing t(1:3) outside the program), and then take the liberty to move the data around, however the debugger would have a problem in displaying the discontiguous array t.

Jim Dempsey