- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The loop in the attached reproducer has no loop carried dependencies and no sum reductions so it must give identical results regardless of the number of threads. But it doesn't.
I tried ifort 17, 18 and 21 and get the same behaviour. The deviations are small but cause my test matrix to fail. And in any case I can't see any good reason for such differences. Can anybody explain or is this a bug ?
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I too see a difference using Intel Parallel Studio XE 2020 in release build on Core i7 2600K.
There is nothing wrong with your code from what I can see.
I suspect that the generated parallel code is experiencing a cache line issue due to unaligned loads and stores.
or different optimizations re: floating point
By changing the optimization options for floating point from fast to precise or strict, you get consistent results.
I suspect it may have to do with the order in execution.
In order to use the same code for both parallel regions, make it one parallel region as a contained procedure:
PROGRAM REPRODUCER
! Build with: ifort -O2 -fp-model fast=1 -xAVX -g -qopenmp t.f90
USE OMP_LIB
IMPLICIT NONE
INTEGER, PARAMETER :: MHNOD = 100
REAL , DIMENSION(MHNOD) :: A, SS, ALFA, ROF, ROG, VV0
INTEGER :: K
REAL, DIMENSION(MHNOD) :: CAUX, CAUXT, DAUX, DAUXT
REAL :: TERM1, TERM2
! Initialize
DO K=1,MHNOD
TERM1 = 0.001 * K
ALFA(K) = 0.5 + TERM1
A(K) = 0.2
SS(K) = 1.2 + TERM1
ROF(K) = 0.7 - TERM1
ROG(K) = ROF(K) * 0.001
VV0(K) = 1.0
ENDDO
! Run 8 threads
call OMP_SET_NUM_THREADS(8)
!!$omp parallel do private(K,TERM1,TERM2)
! DO K=2,78
! TERM1=SS(K)*ALFA(K)/(1.0+ALFA(K)*(SS(K)-1.0))
! TERM2=ROF(K)-ROG(K)
! CAUXT(K)=(ROF(K)-TERM1*TERM2)/A(K)
! DAUXT(K)=VV0(K)*ALFA(K)*TERM2*(1.0-TERM1)
! ENDDO
call foo(CAUXT, DAUXT)
! Run 1 thread
call OMP_SET_NUM_THREADS(1)
!!$omp parallel do private(K,TERM1,TERM2)
! DO K=2,78
! TERM1=SS(K)*ALFA(K)/(1.0+ALFA(K)*(SS(K)-1.0))
! TERM2=ROF(K)-ROG(K)
! CAUX(K)=(ROF(K)-TERM1*TERM2)/A(K)
! DAUX(K)=VV0(K)*ALFA(K)*TERM2*(1.0-TERM1)
! ENDDO
call foo(CAUX, DAUX)
! Check results
DO K=2,78
print '(a,i6,2f14.8)','CAUX: ', K, CAUX(K), CAUXT(K)
IF (CAUX(K) /= CAUXT(K)) THEN
print *,'CAUX DIFF: ', K
ENDIF
ENDDO
contains
subroutine foo(Cout, Dout)
implicit none
real :: Cout(:), Dout(:)
!$omp parallel do private(K,TERM1,TERM2)
DO K=2,78
TERM1=SS(K)*ALFA(K)/(1.0+ALFA(K)*(SS(K)-1.0))
TERM2=ROF(K)-ROG(K)
Cout(K)=(ROF(K)-TERM1*TERM2)/A(K)
Dout(K)=VV0(K)*ALFA(K)*TERM2*(1.0-TERM1)
ENDDO
end subroutine foo
END PROGRAM
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I looked at t.f90 a bit, too.
Results compare when I change -O2 to -O0 or -O1.
With -O2 automatic vectorization occurs. Then there's the miscompare. When I add -no-vec to that compile, the answers are the same.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks to both of you for looking into this. It sounds like whenever parallel reproducability is required then vectorization or aggressive optimization must be disabled. However I can't see the reason. The loop iterations are perfectly independent of each other. There must not be any impact by the order of execution.
After making changes with the number of threads (change 8 to 2) it appears like the deviations are always at indices that match the prolog or epilog of the vectorized loop. Apparently the scalar instructions in the prolog / epilog use somewhat different precision than the vectorized loop body. This doesn't make sense to me. Why should scalar vs. vectorized instructions not yield the exact same results ?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Got it. It's the divides. If divides are replaced by multiplies there are no deviations.
Looking at the assembly source for the divides: The loop body is doing multiplies with the reciprocal which it computes with the VRCPPS instruction while the epilog uses VDIVSS. That explains the deviations.
In my opinion this is a shortcoming of the compiler. The loop epilog should use the same divide instruction with its scalar version (VRCPSS if this exists) just like it does for the adds and multiplies.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>The loop epilog should use the same divide instruction with its scalar version (VRCPSS if this exists) just like it does for the adds and multiplies.
This would be a matter of debate. The -fast option requests faster code (at sacrifice of precision). In this case, the few additional VRCPSS's may have saved a few clocks (at sacrifice of precision).
Being consistent with fpmodel in prolog, body, epilog would reduce the amount of unnecessary bug searching. (you still have the reduction order issue with horizontal add/sub vs scalar).
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The issue goes away when compiling with -prec-div which replaces VRCP with VDIV instructions.
The issue can in the same way be reproduced with square roots which is not a surprise as this is the same class of FP operation. The issue then goes away with -prec-sqrt.
In any case I expect the compiler to apply the FP model consistently to the loop prolog, body and epilog. This is currently not the case and therefore can be considered a deficiency.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page