Problems with openMP 4.0

Mark_K_1 · ‎11-22-2016

I recently upgraded from an old version of the intel fortran compiler (11.1), to a newer version (16.0.3). Unfortunately, my software will not run correctly when compiled in the new ifort using openMP. Everything is fine when compiling in serial, and the two versions of the compiler produce exactly the same result. An example of a parallel loop that seems to cause problems:

!$omp PARALLEL DO
            DO 2700 I=1,IZBINS-1
            DO 2700 J=1,IRBINS-1
C
C       RATE OF CHANGE OF MOMENTUM
C
C       radial momentum
C
              VTR(J,I,K)=VTRO(J,I,K)+DT*(
C
C       positive momentum gain from the left (0 at j=1)
C
     +          AOVFLUX2(J)*
     +          (ABS(VR(J-1,I,K))+VR(J-1,I,K))
     +          *VTRO(J-1,I,K)
C
C       negative momentum gain from right
C
     +          +AOVFLUX1(J)*
     +          (ABS(VR(J+1,I,K))-VR(J+1,I,K))
     +           *VTRO(J+1,I,K)
C
C       loss of positive momentum to the right
C
     +          -AOVFLUX1(J)*
     +          (ABS(VR(J,I,K))+VR(J,I,K))
     +           *VTRO(J,I,K)
C
C       loss of negative momentum to the left (0 at j=1)
C
     +          -AOVFLUX2(J)*
     +          (ABS(VR(J,I,K))-VR(J,I,K))
     +           *VTRO(J,I,K)
C
C       gain of positive or negative momentum from the bottom
C
     +          +1./DBLE(DZICP(I))*(
     +          (ABS(VTZN(J,I-1,K))+VTZN(J,I-1,K))
     +           *VTRO(J,I-1,K))
C
C       gain of positive or negative momentum from the top
C
     +          +1./DBLE(DZICP(I))*(
     +          (ABS(VTZN(J,I,K))-VTZN(J,I,K))
     +          *VTRO(J,I+1,K))
C
C       loss of positive momentum from the cell center through the top
C
     +          -1./DBLE(DZICP(I))*(
     +          (ABS(VTZN(J,I,K))+VTZN(J,I,K))
     +           *VTRO(J,I,K))
C
C       loss of positive momentum from the cell center through the bottom
C
     +          -1./DBLE(DZICP(I))*(
     +          (ABS(VTZN(J,I-1,K))-VTZN(J,I-1,K))
     +           *VTRO(J,I,K))
     +          )*0.5D0
C
C       pressure term: -grad(nkt/m)=D*colf*grad(n)
C
C     +       -8.2635E7/XMASS(K)*DT*
C     +       (TMPION(J+1,I,K)*(XN(J+1,I,K)+XNO(J+1,I,K))*0.5D0 -
C     +        TMPION(J,I,K)*(XN(J,I,K)+XNO(J,I,K))*0.5D0)/DBLE(DRICP(J))
     +       -8.2635E7/XMASS(K)*DT*0.5D0*
     +        ((XNPRES(J+1,I,K)+XNPRESO(J+1,I,K)) -
     +        (XNPRES(J,I,K)+XNPRESO(J,I,K)) )/DBLE(DRICP(J))
C
C       momentum imparted by field
C
     +          +FLOAT(ICHG(K))*DT*ER(J,I)*XNR(J,I,K)/XMASS(K)*9.5E11
C
 2700       CONTINUE
!$omp END PARALLEL DO

Essentially, it is just one single calculation for each mesh point. I don't see any data dependency, or any reason at all why openMP would have trouble with this loop. I am compiling with the options:

-qopenmp -c -O3 -mcmodel=large -shared-intel -save -vms -cpp -w -fpp

And linking with the options:

-qopenmp -o -mcmodel=large -shared-intel -ldl  \
-liomp5 -lpthread -lm -p -fpp

This loop is only one example. Many of the loops are causing problems. If I comment out all of the omp statements, and compile with openMP flags, things work fine. But when I start adding openMP back to loop, several cause problems.

Can anybody else see why the older version of the compiler would result in different behavior than the new compiler for this code?

Thanks,

Chad

jimdempseyatthecove · ‎11-22-2016

Use: !$OMP PARALLEL DO PRIVATE(J)

Jim Dempsey

jimdempseyatthecove · ‎11-22-2016

You may also want to experiment with

!$OMP PARALLEL DO COLLAPSE(2)

Jim Dempsey

jimdempseyatthecove · ‎11-22-2016

Also, avoid using single precision literals in double precision expressions

change +1./DBLE(DZICP(I))*(

to +1.0D0/DBLE(DZICP(I))*(

-8.2635E7 to -8.2635D7

It is unknown as to if VTR and other arrays are single or double. Using mixed precision hinders optimal vectorization.

Jim Dempsey

Mark_K_1 · ‎11-22-2016

Thanks for the suggestion, but unfortunately, that didn't do it. I have also tried to be even more explicit:

!$OMP parallel do default(shared) private(I,J)

But I still does not give the same results as the previous compiler (or when compiled in serial).

Mark_K_1 · ‎11-22-2016

Is there any difference in the behavior of mixing single and double precision when compiling with openMP as opposed to without? Could that possibly explain the difference? I don't expect my test case to be very stiff, and when I turn on openMP the answer isn't just not-quite-right, it won't even converge.

jimdempseyatthecove · ‎11-22-2016

>>But I still does not give the same results as the previous compiler (or when compiled in serial).

Multi-threaded code will not necessarily produce the same output as serial code (or between runs, or amongst different versions of the compiler). This is due to different subsets of the calculated data accumulating different sets (and order) of roundoff errors.

The above loop though should not show a difference in accumulation of roundoff errors as each element is individually calculated. The results could differ depending upon the "old" code using the 8087 FPU instructions and the newer compiled code using SSE/AVX instructions. Also there are switches that will effect if the floating point flushes to zero or to sub-normals.

I suspect you may also have a parallel DO that performs a reduction and that is susceptible to differing accumulation of roundoffs depending on partitioning for multi-threading.

Lack of convergence is typically due to constant values determined under one circumstance (CPU, FPU, compiler) and run under a different circumstance. The better route is to use some small multiplier of a calculated deviation.

Assume you want to converge when the difference in a result is less than ~2.5 least significant bits.

delta = A - B
if(ABS(delta) .LT. (SPACING(MAX(ABS(A), ABS(B)) * 5)) EXIT

Jim Dempsey