Trivial OpenMP loop gives different parallel results with -fp-model fast=1

mriedmann · ‎02-03-2022

The loop in the attached reproducer has no loop carried dependencies and no sum reductions so it must give identical results regardless of the number of threads. But it doesn't.

I tried ifort 17, 18 and 21 and get the same behaviour. The deviations are small but cause my test matrix to fail. And in any case I can't see any good reason for such differences. Can anybody explain or is this a bug ?

jimdempseyatthecove · ‎02-03-2022

I too see a difference using Intel Parallel Studio XE 2020 in release build on Core i7 2600K.

There is nothing wrong with your code from what I can see.

I suspect that the generated parallel code is experiencing a cache line issue due to unaligned loads and stores.

or different optimizations re: floating point

By changing the optimization options for floating point from fast to precise or strict, you get consistent results.

I suspect it may have to do with the order in execution.

In order to use the same code for both parallel regions, make it one parallel region as a contained procedure:

PROGRAM REPRODUCER

    ! Build with: ifort  -O2 -fp-model fast=1 -xAVX -g -qopenmp t.f90

    USE OMP_LIB
    IMPLICIT NONE

    INTEGER, PARAMETER        :: MHNOD = 100
    REAL   , DIMENSION(MHNOD) :: A, SS, ALFA, ROF, ROG, VV0

    INTEGER                   :: K
    REAL, DIMENSION(MHNOD)    :: CAUX, CAUXT, DAUX, DAUXT
    REAL                      :: TERM1, TERM2

    ! Initialize
    DO K=1,MHNOD
       TERM1   = 0.001 * K 
       ALFA(K) = 0.5 + TERM1
       A(K)    = 0.2
       SS(K)   = 1.2 + TERM1
       ROF(K)  = 0.7 - TERM1
       ROG(K)  = ROF(K) * 0.001
       VV0(K)  = 1.0
    ENDDO
    
    ! Run 8 threads
    call OMP_SET_NUM_THREADS(8)

!!$omp parallel do private(K,TERM1,TERM2)
!    DO K=2,78
!       TERM1=SS(K)*ALFA(K)/(1.0+ALFA(K)*(SS(K)-1.0))
!       TERM2=ROF(K)-ROG(K)
!       CAUXT(K)=(ROF(K)-TERM1*TERM2)/A(K)
!       DAUXT(K)=VV0(K)*ALFA(K)*TERM2*(1.0-TERM1)
!    ENDDO
    call foo(CAUXT, DAUXT)
    
    ! Run 1 thread
    call OMP_SET_NUM_THREADS(1)

!!$omp parallel do private(K,TERM1,TERM2)
!    DO K=2,78
!       TERM1=SS(K)*ALFA(K)/(1.0+ALFA(K)*(SS(K)-1.0))
!       TERM2=ROF(K)-ROG(K)
!       CAUX(K)=(ROF(K)-TERM1*TERM2)/A(K)
!       DAUX(K)=VV0(K)*ALFA(K)*TERM2*(1.0-TERM1)
!    ENDDO
    call foo(CAUX, DAUX)

    ! Check results
    DO K=2,78
       print '(a,i6,2f14.8)','CAUX: ', K, CAUX(K), CAUXT(K)
       IF (CAUX(K) /= CAUXT(K)) THEN
          print *,'CAUX DIFF: ', K
       ENDIF
    ENDDO

    contains
subroutine foo(Cout, Dout)
    implicit none
    real :: Cout(:), Dout(:)
    
!$omp parallel do private(K,TERM1,TERM2)
    DO K=2,78
       TERM1=SS(K)*ALFA(K)/(1.0+ALFA(K)*(SS(K)-1.0))
       TERM2=ROF(K)-ROG(K)
       Cout(K)=(ROF(K)-TERM1*TERM2)/A(K)
       Dout(K)=VV0(K)*ALFA(K)*TERM2*(1.0-TERM1)
    ENDDO
end subroutine foo
END PROGRAM

Jim Dempsey

Barbara_P_Intel · ‎02-04-2022

I looked at t.f90 a bit, too.

Results compare when I change -O2 to -O0 or -O1.

With -O2 automatic vectorization occurs. Then there's the miscompare. When I add -no-vec to that compile, the answers are the same.

mriedmann · ‎02-07-2022

Thanks to both of you for looking into this. It sounds like whenever parallel reproducability is required then vectorization or aggressive optimization must be disabled. However I can't see the reason. The loop iterations are perfectly independent of each other. There must not be any impact by the order of execution.

After making changes with the number of threads (change 8 to 2) it appears like the deviations are always at indices that match the prolog or epilog of the vectorized loop. Apparently the scalar instructions in the prolog / epilog use somewhat different precision than the vectorized loop body. This doesn't make sense to me. Why should scalar vs. vectorized instructions not yield the exact same results ?

mriedmann · ‎02-07-2022

Got it. It's the divides. If divides are replaced by multiplies there are no deviations.

Looking at the assembly source for the divides: The loop body is doing multiplies with the reciprocal which it computes with the VRCPPS instruction while the epilog uses VDIVSS. That explains the deviations.

In my opinion this is a shortcoming of the compiler. The loop epilog should use the same divide instruction with its scalar version (VRCPSS if this exists) just like it does for the adds and multiplies.

jimdempseyatthecove · ‎02-07-2022

>>The loop epilog should use the same divide instruction with its scalar version (VRCPSS if this exists) just like it does for the adds and multiplies.

This would be a matter of debate. The -fast option requests faster code (at sacrifice of precision). In this case, the few additional VRCPSS's may have saved a few clocks (at sacrifice of precision).

Being consistent with fpmodel in prolog, body, epilog would reduce the amount of unnecessary bug searching. (you still have the reduction order issue with horizontal add/sub vs scalar).

Jim Dempsey

mriedmann · ‎02-08-2022

The issue goes away when compiling with -prec-div which replaces VRCP with VDIV instructions.

The issue can in the same way be reproduced with square roots which is not a surprise as this is the same class of FP operation. The issue then goes away with -prec-sqrt.

In any case I expect the compiler to apply the FP model consistently to the loop prolog, body and epilog. This is currently not the case and therefore can be considered a deficiency.