I recently upgraded from an old version of the intel fortran compiler (11.1), to a newer version (16.0.3). Unfortunately, my software will not run correctly when compiled in the new ifort using openMP. Everything is fine when compiling in serial, and the two versions of the compiler produce exactly the same result. An example of a parallel loop that seems to cause problems:
!$omp PARALLEL DO DO 2700 I=1,IZBINS-1 DO 2700 J=1,IRBINS-1 C C RATE OF CHANGE OF MOMENTUM C C radial momentum C VTR(J,I,K)=VTRO(J,I,K)+DT*( C C positive momentum gain from the left (0 at j=1) C + AOVFLUX2(J)* + (ABS(VR(J-1,I,K))+VR(J-1,I,K)) + *VTRO(J-1,I,K) C C negative momentum gain from right C + +AOVFLUX1(J)* + (ABS(VR(J+1,I,K))-VR(J+1,I,K)) + *VTRO(J+1,I,K) C C loss of positive momentum to the right C + -AOVFLUX1(J)* + (ABS(VR(J,I,K))+VR(J,I,K)) + *VTRO(J,I,K) C C loss of negative momentum to the left (0 at j=1) C + -AOVFLUX2(J)* + (ABS(VR(J,I,K))-VR(J,I,K)) + *VTRO(J,I,K) C C gain of positive or negative momentum from the bottom C + +1./DBLE(DZICP(I))*( + (ABS(VTZN(J,I-1,K))+VTZN(J,I-1,K)) + *VTRO(J,I-1,K)) C C gain of positive or negative momentum from the top C + +1./DBLE(DZICP(I))*( + (ABS(VTZN(J,I,K))-VTZN(J,I,K)) + *VTRO(J,I+1,K)) C C loss of positive momentum from the cell center through the top C + -1./DBLE(DZICP(I))*( + (ABS(VTZN(J,I,K))+VTZN(J,I,K)) + *VTRO(J,I,K)) C C loss of positive momentum from the cell center through the bottom C + -1./DBLE(DZICP(I))*( + (ABS(VTZN(J,I-1,K))-VTZN(J,I-1,K)) + *VTRO(J,I,K)) + )*0.5D0 C C pressure term: -grad(nkt/m)=D*colf*grad(n) C C + -8.2635E7/XMASS(K)*DT* C + (TMPION(J+1,I,K)*(XN(J+1,I,K)+XNO(J+1,I,K))*0.5D0 - C + TMPION(J,I,K)*(XN(J,I,K)+XNO(J,I,K))*0.5D0)/DBLE(DRICP(J)) + -8.2635E7/XMASS(K)*DT*0.5D0* + ((XNPRES(J+1,I,K)+XNPRESO(J+1,I,K)) - + (XNPRES(J,I,K)+XNPRESO(J,I,K)) )/DBLE(DRICP(J)) C C momentum imparted by field C + +FLOAT(ICHG(K))*DT*ER(J,I)*XNR(J,I,K)/XMASS(K)*9.5E11 C 2700 CONTINUE !$omp END PARALLEL DO
Essentially, it is just one single calculation for each mesh point. I don't see any data dependency, or any reason at all why openMP would have trouble with this loop. I am compiling with the options:
-qopenmp -c -O3 -mcmodel=large -shared-intel -save -vms -cpp -w -fpp
And linking with the options:
-qopenmp -o -mcmodel=large -shared-intel -ldl \ -liomp5 -lpthread -lm -p -fpp
This loop is only one example. Many of the loops are causing problems. If I comment out all of the omp statements, and compile with openMP flags, things work fine. But when I start adding openMP back to loop, several cause problems.
Can anybody else see why the older version of the compiler would result in different behavior than the new compiler for this code?
Also, avoid using single precision literals in double precision expressions
-8.2635E7 to -8.2635D7
It is unknown as to if VTR and other arrays are single or double. Using mixed precision hinders optimal vectorization.
Thanks for the suggestion, but unfortunately, that didn't do it. I have also tried to be even more explicit:
!$OMP parallel do default(shared) private(I,J)
But I still does not give the same results as the previous compiler (or when compiled in serial).
Is there any difference in the behavior of mixing single and double precision when compiling with openMP as opposed to without? Could that possibly explain the difference? I don't expect my test case to be very stiff, and when I turn on openMP the answer isn't just not-quite-right, it won't even converge.
>>But I still does not give the same results as the previous compiler (or when compiled in serial).
Multi-threaded code will not necessarily produce the same output as serial code (or between runs, or amongst different versions of the compiler). This is due to different subsets of the calculated data accumulating different sets (and order) of roundoff errors.
The above loop though should not show a difference in accumulation of roundoff errors as each element is individually calculated. The results could differ depending upon the "old" code using the 8087 FPU instructions and the newer compiled code using SSE/AVX instructions. Also there are switches that will effect if the floating point flushes to zero or to sub-normals.
I suspect you may also have a parallel DO that performs a reduction and that is susceptible to differing accumulation of roundoffs depending on partitioning for multi-threading.
Lack of convergence is typically due to constant values determined under one circumstance (CPU, FPU, compiler) and run under a different circumstance. The better route is to use some small multiplier of a calculated deviation.
Assume you want to converge when the difference in a result is less than ~2.5 least significant bits.
delta = A - B
if(ABS(delta) .LT. (SPACING(MAX(ABS(A), ABS(B)) * 5)) EXIT