Thread lost in OMP Critical area in Intel ifort v12

a_zhaogtisoft_com · ‎04-04-2011

Hi,

I have been scratching my head for a very strange problem with Intel ifort v12.0: it seems that the OMP CRITICAL area occasionally miss a thread:

Here is my F90 code (using SPMD model)

program main
call omp_set_dynamic( .false.)
call omp_set_nested( .false.)
!$omp parallel
call compute_pi
!$omp end parallel
stop
end

subroutine compute_pi
double precision :: psum,x,w ! threadprivate
integer :: me,nimg,i ! threadprivate
double precision :: pi, mypi
integer :: n, j
common /pin/ pi, n
!*omp threadshared(/pin/)
integer omp_get_num_threads,omp_get_thread_num
nimg = omp_get_num_threads()
me = omp_get_thread_num() + 1
!$omp master
write(6,*) 'Enter number of intervals'
read(5,*) n
write(6,*) 'number of intervals = ',n
!$omp end master
!$omp barrier
pi = 0.d0
w = 1.d0/n
psum = 0.d0

do i= me,n,nimg
x = w * (i - 0.5d0)
psum = psum + 4.d0/(1.d0+x*x)
enddo
!$omp barrier
!$omp critical
pi = pi + (w * psum)
!$omp end critical
!$omp master
write(6,*) 'computed pi = ',pi
!$omp end master
!$omp barrier
end

When run, above code is supposed to compute pi. I try the code on my 2 cores CPU, most of the time, it does the job, but occasionally, instead of 3.1417..., it prints 1.5901...

So i run it in debug, and found out occasionally the following
!$omp critical
pi = pi + (w * psum)
!$omp end critical
which is supposed to sum up all the pi value in different threads, one thread is lost (and therefore, 1/2 of the pi value in my duo core PC).

Is this a Intel compiler bug? or my code bug?

BTW, the code is from "A Comparison of Co-Array Fortran and OpenMP Fortran for SPMD Programming" with slight modification (http://www7300.nrlssc.navy.mil/global_n ... t-2002.pdf).

a_zhaogtisoft_com · ‎04-05-2011

A missing !$omp barrier is causing the issue. The correct code should be:

!$omp barrier
!$omp critical
pi = pi + (w * psum)
!$omp end critical
!$omp barrier
!$omp master
write(6,*) 'computed pi = ',pi
!$omp end master

robert-reed · ‎04-05-2011

Well, I can't assign blame for the failure here to any particular component, but I must say, military source or no, this is pretty ratty Fortran, and it might stress the runtime system. It looks like it was cut and pasted, with some loose ends in the modifications, and is pretty primitive OMP to boot (and looks like both PSUM and MYPI were intended for the same purpose but only one got used).

Why else would the code enter the parallel section before calling the Pi-computing subroutine (where it has to jump through the "master" hoop to avoid the consequences of that previous action)? And declaring pi in the shared common, then having all the threads individually zero it is probably not the most efficient construction. But why all this critical section stuff when it's so much easier to use a reduction as I did recently in writing this version of OMP Fortran PI:

[fortran]REAL(8) PI,STEP,X
PI = 0.0_8
STEP = 1.0_8 / FLOAT(STEPS)
!$OMP PARALLEL SHARED(STEP)
    !$OMP DO PRIVATE(I,X) REDUCTION(+:PI)
    DO I = 0, STEPS
        X = (FLOAT(I) + 0.5_8) * STEP
        PI = PI + 4.0_8 / (1.0_8 + X * X)
    END DO
    !$OMP END DO
!$OMP END PARALLEL

DPI = PI * STEP[/fortran]

Despite all the loose ends in the code you shared, I haven't looked closely enough at itto determine whether it contains any errors. I have tested the code I share above, though, and it does seem to return the right answer every time I've tried it.

a_zhaogtisoft_com · ‎04-05-2011

Hi, Robert,

Thanks for the code. The extra declaration of mypi and j are my left-over (as I am trying to debug and benchmark), sorry. But rest of code are copied/pasted from the PDF.

The original code from the article was meant to compare the co-array and openmp in SPMD, and basically, it was a one to one translation from a co-array code. Obviously, the "Sync All" will become "!$OMP BARRIER", and so on.

Of course, I will agree with you that your loop level parallel code is much much better. But currently it seems that it is not even possible for our company's code to go to loop level parallelization, because we use tons of thousands modules (globals), and no one even knows how to make all these globals works in loop level parallelization. Because of the difficulties, we have looked into using co-array, where most of the data are private by default. Then I come across this paper...