Surprising rounding behavior with FLOOR() with code optimized for AVX2

Ryan_S_3 · ‎04-06-2017

We recently upgraded to ifort version 17.0.2, and a piece of code that had worked for decades suddenly broke. I traced the problem to a call to FLOOR() with a REAL*4, and was surprised at what I found.

FLOOR(), by definition, should return the greatest integer less than or equal to its argument (http://software.intel.com/en-us/node/679278). Thus, (x-FLOOR(x)) should never be negative. But I have an example that violates this. Here is a very simple subroutine:

SUBROUTINE floortst(x,n)
  REAL*4 :: x, y
  INTEGER*4 :: n

  y = x*n - FLOOR(x*n)
  IF (y < 0.e0) THEN
     PRINT '("v-FLOOR(v) is negative!")'
  ELSE IF (y > 1.e0) THEN
     PRINT '("v-FLOOR(v) is >1!")'
  ELSE
     PRINT '("v-FLOOR(v) lies within [0,1]")'
  END IF
  RETURN

END SUBROUTINE floortst

It should always print "v-FLOOR(v) lies within [0,1]". However, when I compile this code with "-O1 -xAVX2" using ifort 17.0.2, and call the routine with arguments n=400000 and x=0.3496874869e0, it prints "v-FLOOR(v) is negative!". I get the expected behavior if I remove -xAVX2 or change -O1 to -O0, so the issue appears to be an AVX2-specific optimization. (I am running the code on an i7-7700.)

To be clear, I understand that the optimizer can make some modifications that impact floating point accuracy, but an optimization that causes x-FLOOR(x) to be negative seems spectacularly unsafe. Am I missing something, or is this really the expected behavior?

By the way, the above code works as expected for all versions of ifort and GCC that I have access to, except for ifort 17.0.2.

Xiaoping_D_Intel · ‎04-06-2017

I can't reproduce the problem using your code. It always printed out "v-FLOOR(v) lies within [0,1]". Can you provide a complete case to show how the subroutine is called?

BTW, for AVX2 target the option is "-xCORE-AVX2".

Thanks,

Xiaoping Duan

Intel Customer Support

jimdempseyatthecove · ‎04-07-2017

Out of curiosity, print out the HEX code of Y (as well as the F15.12 value).

Are you aware that there is s -0.0?

Jim

Johannes_Rieke · ‎04-07-2017

I can reproduce the unexpected results with PSXE 2017u2 (on Windows with Xeon X5-1650v3):

module mod_floortst
  implicit none
  
  private
  public :: floortst, floortst2
  
  contains
  
  SUBROUTINE floortst(x,n)
      implicit none
	  REAL(4)   , intent(in) :: x
	  INTEGER(4), intent(in) :: n
      real(4)                :: y
      
	  y = x*n - FLOOR(x*n)
	  IF (y < 0.e0) THEN
	     PRINT '("v-FLOOR(v) is negative!")'
	  ELSE IF (y > 1.e0) THEN
	     PRINT '("v-FLOOR(v) is >1!")'
	  ELSE
	     PRINT '("v-FLOOR(v) lies within [0,1]")'
	  END IF
	  RETURN
  END SUBROUTINE floortst
  
  
  SUBROUTINE floortst2(x,n)
      implicit none
	  REAL(4)   , intent(in) :: x
	  INTEGER(4), intent(in) :: n
      real(4)                :: y
      
	  y = x*real(n,4) - FLOOR(x*real(n,4))
	  IF (y < 0.e0) THEN
	     PRINT '("v-FLOOR(v) is negative!")'
	  ELSE IF (y > 1.e0) THEN
	     PRINT '("v-FLOOR(v) is >1!")'
	  ELSE
	     PRINT '("v-FLOOR(v) lies within [0,1]")'
	  END IF
	  RETURN
  END SUBROUTINE floortst2
  
end module mod_floortst

program floor_01
  use mod_floortst, only : floortst, floortst2
  implicit none

  ! Variables
  real(4)    :: x
  integer(4) :: n

  ! Body of floor_01
  x = 0.3496874869e0
  n = 400000
  write(*,'("x = ",1f14.10," kind of x = ",i4)') x, kind(x)
  write(*,'("n = ",1i14   )') n
  
  write(*,'("floortest original:")')
  call floortst(x,n)
  
  
  write(*,'("floortest explicit type conversion:")')
  call floortst2(x,n)

end program floor_01

Build log:

Compiling with Intel(R) Visual Fortran Compiler 17.0.2.187 [IA-32]...
ifort /nologo /O3 /QxCORE-AVX2 /module:"Release\\" /object:"Release\\" /Fd"Release\vc140.pdb" /libs:dll /threads /c /Qlocation,link,"C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\\bin" /Qm32 "D:\02_Fortran\99_test\floor_01\mod_floortst.f90"
ifort /nologo /O3 /QxCORE-AVX2 /module:"Release\\" /object:"Release\\" /Fd"Release\vc140.pdb" /libs:dll /threads /c /Qlocation,link,"C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\\bin" /Qm32 "D:\02_Fortran\99_test\floor_01\floor_01.f90"
Linking...
Link /OUT:"Release\floor_01.exe" /INCREMENTAL:NO /NOLOGO /MANIFEST /MANIFESTFILE:"Release\floor_01.exe.intermediate.manifest" /MANIFESTUAC:"level='asInvoker' uiAccess='false'" /SUBSYSTEM:CONSOLE /IMPLIB:"D:\02_Fortran\99_test\floor_01\Release\floor_01.lib" -qm32 "Release\mod_floortst.obj" "Release\floor_01.obj"
Embedding manifest...
mt.exe /nologo /outputresource:"D:\02_Fortran\99_test\floor_01\Release\floor_01.exe;#1" /manifest "Release\floor_01.exe.intermediate.manifest"

floor_01 - 0 error(s), 0 warning(s)

The results:

x =   0.3496874869 kind of x =    4
n =         400000
floortest original:
v-FLOOR(v) is negative!
floortest explicit type conversion:
v-FLOOR(v) is negative!

Johannes_Rieke · ‎04-07-2017

Sorry, forgot to say. Works in Debug mode (-Od).

edit: -O3 -xCORE-AVX2 with print output of y in case of failure:

x =   0.3496874869 kind of x =    4
n =         400000
floortest original:
v-FLOOR(v) is negative!
y =  -0.0052452087 kind of y =    4
floortest explicit type conversion:
v-FLOOR(v) is negative!

So, no -0.0 case, Jim, if I understood you correctly.

regards, Johannes

Johannes_Rieke · ‎04-07-2017

using -fp:source and -fp-speculation=safe avoids the issue. Debug and O3/AVX2 deliver the same - expected - results. Vectorizing is deactivated by this, isn't it?

Is this expected to happen, if you use fast math and fast floating point speculation?

Ryan_S_3 · ‎04-07-2017

Mr. Duan: Johannes beat me to it, but here is a complete working example.

Info on my machine and compiler:

src$ uname -a
Linux orpheus 4.4.0-53-generic #74-Ubuntu SMP Fri Dec 2 15:59:10 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
src$ 
src$ head -19 /proc/cpuinfo
processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 158
model name	: Intel(R) Core(TM) i7-7700K CPU @ 4.20GHz
stepping	: 9
microcode	: 0x48
cpu MHz		: 800.000
cache size	: 8192 KB
physical id	: 0
siblings	: 8
core id		: 0
cpu cores	: 4
apicid		: 0
initial apicid	: 0
fpu		: yes
fpu_exception	: yes
cpuid level	: 22
wp		: yes
src$ 
src$ ifort -v
ifort version 17.0.2

Source code:

src$ cat floortst.f90
SUBROUTINE floortst(x,n)
  REAL*4 :: x, y
  INTEGER*4 :: n, iy

  y = x*n
  iy = FLOOR(y)
  y = y - iy
  IF (y < 0.e0) THEN
     PRINT '("v-FLOOR(v) is negative! y=",E17.10)', y
  ELSE IF (y > 1.e0) THEN
     PRINT '("v-FLOOR(v) is >1! y=",E17.10)', y
  ELSE
     PRINT '("v-FLOOR(v) lies within [0,1], y=",E17.10)', y
  END IF
  RETURN

END SUBROUTINE floortst

PROGRAM drv
  INTEGER*4 :: n
  REAL*4    :: x
  n = 400000
  x = 0.3496874869e0
  CALL floortst(x,n)
END PROGRAM drv

Compile and run:

src$ ifort -O1 -xCORE-AVX2 -o floortst floortst.f90
src$ ./floortst 
v-FLOOR(v) is negative! y=-0.5245208740E-02

Mr. Dempsey: As you can see, y is not -0.e0. Even if it was, I am not sure why that would be relevant, since IEEE754 states that -0.e0 and +0.e0 should compare as equal. But perhaps I am missing your point?

TimP · ‎04-07-2017

On my ifort installation, when you build with fma enabled, the first product is rounded, while the second one is not, as it is internal to the fma. I believe the preferred gfortran style option to disable fma is -mno-fma (Intel Windows option -Qfma-). Evidently, that is implied by several other options you tried. The use of that option is more common with gfortran as there are more bad effects of fma in that compiler.

You don't have any vectorization in your example, unless you may have in-lined it into code you didn't show. fma applies to both vector and serial code; in fact, it may have more effect on performance in the absence of vectorization.

TimP · ‎04-07-2017

I believe the preferred gfortran style option to disable fma is -mno-fma. Evidently, it is implied by some other options you tried. With fma enabled, my ifort is rounding the first product and subtracting the second product without rounding.

fma applies to both non-vector code such as you show here and to vector code. It may have more effect on performance in the absence of vectorization.

TimP · ‎04-07-2017

I think the preferred gfortran style option to disable fma is -mno-fma. With fma enabled, as it normally is for AVX2, my ifort rounds the first product and subtracts the second un-rounded product. Evidently, some other options you tried are disabling fma as well.

I suppose speculation settings might affect whether the compiler chooses to recognize the 2 products as common subexpressions and suppress the fma even with optimization on. It does seem in this case that the simple product could just as well be used throughout rather than using such convoluted code.

This would apply to non-vectorizable code such as you show as well as vectorizable code. There may be more effect on performance in the non-vector case.

Ryan_S_3 · ‎04-07-2017

Tim P.: See the complete working example I posted on 4/7/17. In that case, it is clear that there is only one product being computed, yet the problem still occurs. Here is the relevant excerpt:

  y = x*n
  iy = FLOOR(y)
  y = y - iy

I am not sure how I can be any more explicit in expressing my intent to the compiler. Whatever rounding mode ifort wants to use when computing y is fine with me, as long as it is consistent.

Xiaoping_D_Intel · ‎04-09-2017

Ryan S. wrote:

Tim P.: See the complete working example I posted on 4/7/17. In that case, it is clear that there is only one product being computed, yet the problem still occurs. Here is the relevant excerpt:
  y = x*n
  iy = FLOOR(y)
  y = y - iy
I am not sure how I can be any more explicit in expressing my intent to the compiler. Whatever rounding mode ifort wants to use when computing y is fine with me, as long as it is consistent.

In this case FMA is still generated by compiler at O1 as below:

y = x*n- iy

For O2 and above the compiler will inline the function call to compute the product at compile time and not generate FMA.

Thanks,

Xiaoping Duan

Intel Customer Support

Johannes_Rieke · ‎04-10-2017

Hi Ryan,

I'm just a curious person. With -inline-level=0 (/Ob0 or /Ob1) you can force the compiler not to inline your expression, which is done for -O2/3 as Xiaoping stated. Alternatively you can print/write intermediate products y an iy, which is not desired for production code for sure.

I tested to deactivate inlining with /Ob0, but that has no effect! Also, I could not find any clue in the optimization report, that inlining is done. Isn't FMA or IPO causing the difference? Can I trust the report (-opt-report:5)? Only by using -fp:source or -fp:strict the results are OK.

Nevertheless, it is no good idea to work with single in this case, where your multipliers can get up to approx. 1e5, because you have only six significant digits and the last one, I would not trust. However, the deviation between iy and y is -0.0052452087 and bigger than 1e-6?

Johannes_Rieke · ‎04-10-2017

It's just different rounding at different parts in machine code, I think. If you look at the products x*n it's easy to see:

x=0.3496874869e0, n=400000
x*n = 139874.9947547913 (real64)
x*n = 139875.0000000000 (real32)
floor(x*n)=139874 (real64)
floor(x*n)=139875 (real32)
x*n-floor(x*n)=-0.0052452087 if x*n is rounded before given to floor function

With -fsource-asm one can check the assembly code. This shows up inlining as Xiaoping said. With -O3 no call is seen.

@Ryan, your solution seems to be non-save. Maybe you can adopt mod for your purpose.

jimdempseyatthecove · ‎04-10-2017

To get the expected result

y = x*n
y = y - floor(y) ! should have same rounding

(you may have to play with -fp:source)

Jim Dempsey

Ryan_S_3 · ‎04-10-2017

Thank you all for the helpful and interesting replies.

Jim Dempsey suggested:

y = x*n
y = y - floor(y) ! should have same round

I agree with Jim: this should have the same round. But it does not. Lots of smart people have replied to this post, and the emperical evidence here clearly demonstrates that no reasonable programmer would anticipate what ifort is doing.

Here is something else that does not work, but seems like it should:

  REAL*4 :: y,z
  INTEGER*4 :: iy
  z  = x*n
  iy = z
  y  = z - iy

There are many other obvious variations on this theme, and they all fail unless I start playing with compiler flags. (I tried all these things before posting!) In our case, introducing compiler flags other than (-xHost -O1) is not really an option for this particular piece of code.

This works:

  REAL*4 :: y,z
  INTEGER*4 :: iy
  VOLATILE z
  z  = x*n
  iy = z
  y  = z - iy

and so does

  z  = x*n
  iy = FLOOR(z)
  y  = z - iy
  IF (y<0.e0) THEN
     y=y+1.e0
     iy=iy-1
  ELSE IF (y>1.e0) THEN
     y=y-1.e0
     iy=iy+1
  END IF

Johannes argues that REAL*4 is not safe, and that I should switch to REAL*8. For performance reasons, REAL*8 is not an option. Also, I disagree with the conclusion that REAL*4 is unsafe, for the following reason: We are quite aware of how floating point arithmetic works, and work hard to make our code robust. I don't care if n*x gets rounded to 139875.00 (producing y=0.00) or 139874.99 (producing y=0.99) or to something else, as long as the rounding is done consistently.

I simply cannot see how anyone could anticipate that expressions like "y=x*n \n y=y-FLOOR(y)" or "x*n-FLOOR(x*n)" might produce a negative result. I encourage Intel to think carefully about whether this is really how their compiler should work with -O1.

In the meantime, I have a few million lines to code to go through, looking for cases of integer truncation that have been fine for decades, but are no longer safe with our new version of ifort...

Ryan_S_3 · ‎04-10-2017

Johannes suggested that it may be safer to use MODULO than FLOOR. It was a good suggestion, but I find the result even more disappointing.

Here is what Intel says MODULO does http://software.intel.com/en-us/node/679578 (emphasis added):

result = MODULO (a,p) .... result has a value such that a = q * p + result where q is an integer .... If p > 0, then 0 <= result < p.

Here is example code:

SUBROUTINE floortst(x,n)
  REAL*4    :: x, y, z
  INTEGER*4 :: n

  z = x*n
  y = MODULO(z,1.e0)
  
  IF (y < 0.e0) THEN
     PRINT '("MODULO(v,1.e0) is negative! y=",E17.10)', y
  ELSE IF (y > 1.e0) THEN
     PRINT '("MODULO(v,1.e0) is >1! y=",E17.10)', y
  ELSE
     PRINT '("MODULO(v,1.e0) lies within [0,1], y=",E17.10)', y
  END IF
  RETURN

END SUBROUTINE floortst

PROGRAM drv
  INTEGER*4 :: n
  REAL*4    :: x
  n = 400000
  x = 0.3496874869e0
  CALL floortst(x,n)
END PROGRAM drv

Compiling with "-O1 -xCORE-AVX2" produces

MODULO(v,1.e0) is negative! y=-0.5245208740E-02

So, MODULO(z,1.e0) returns a negative number, even when the second argument is positive! This directly contradicts Intel's own documentation. Surely this can't be an acceptable result.

(A quick look at the assembly shows that ifort is replacing MODULO with FLOOR, so this is really just another example of the same problem, but in this case one that directly violates Intel's own documentation.)

Johannes_Rieke · ‎04-11-2017

Hi Ryan,

very interesting that modulo is replaced by floor. I have observed, that the issue is only present if -xCORE-AVX2 is present. Simple -O3 is OK. Further, I think that the numbers you have chosen is an extreme case of numbers. Intel is free to do rounding it one or another way, aren't they? For AVX2 they have chosen to round different. By the way, PSXE 2016 behaves the same for me.

Maybe someone from Intel can comment this?

In any case setting -fp:source should fix this even for AVX2. If you have a single routine, which is affected, you could change the compiler option only for this one. I wish you luck.

Johannes

Martyn_C_Intel · ‎04-12-2017

Hi Everyone,

Various people above have correctly identified FMA instructions as the source of the observed behavior. The function FLOOR(x*n) returns correct results; however, you are testing on the result of an expression, x*n - FLOOR(x*n), not the value of FLOOR(x*n) itself. The differences result from the optimization of this expression using an FMA (here, actually a fused multiply and subtract). The use of MODULO intrinsic function disguises the presence of the FMA somewhat, but as was already pointed out, it generates inline code that is essentially the same and is subsequently optimized in the same way as part of the entire program unit. The original generated inline code before optimization conforms to the documentation.

FMA's break the symmetry of expressions, it has nothing to do with the presence of intrinsic functions such as FLOOR or MODULO. Here is one of my favorite examples that contains no function calls:

real function test_fmsub(a, b, c, d)
  real, intent(in ) :: a, b, c, d
  test_fmsub = a*b - c*d
end

program drive_test_fmsub
  real :: a, b, c, d, result, test_fmsub
  a=0.3496874869e0
  b=40000
  c=a
  d=b
  result = test_fmsub(a,b,c,d)
  print *, result
end program drive_test_fmsub

Function and driver are two separate source files, so that I don't have to worry about inlining and the compiler evaluating stuff at compile time

$ ifort -xcore-avx2 test_fmsub.f90 drive_test_fmsub.f90; ./a.out
4.5204163E-04

If I disable FMA generation, either by compiling with -no-fma, by using a !DIR$ NOFMA directive (which has function-wide scope), by replacing -xcore-avx2 by -xavx or by compiling with -fp-model consistent, then the symmetry of the expression is preserved and there is an exact cancellation:

$ ifort -xcore-avx2 -no-fma test_fmsub.f90 drive_test_fmsub.f90; ./a.out
0.0000000E+00

There is some discussion of this (and other floating-point behavior) in my article attached as PDF at https://software.intel.com/en-us/articles/consistency-of-floating-point-results-using-the-intel-compiler/

But fundamentally, this is just a consequence of the finite accuracy of floating-point calculations. There isn't a "right" and "wrong" answer; both are equally valid and within the expected uncertainty of the calculation. In general, an FMA instruction is slightly more accurate than a separate multiply and add, but it is different. You could qually well say that the non-zero result comes from the more limited accuracy of the separate multiply and add instructions. FMA generation is an optimization; the language standards do not specify how expressions should be broken down into FMAs, so (a*b-(c*d)) is as good as ((a*b)-c*d), but the result may be different. As was noted, splitting the calculation onto multiple source lines or using parentheses does not necessarily prevent the compiler optimization that generates FMAs. A rather ugly way to prevent it would be to compute each product separately and store in a temporary variable that is declared VOLATILE.

The best way to prevent these effects, if you need to, is to disable FMAs entirely using one of the methods above. -fp-model strict (/fp:strict on Windows) also disables FMAs, but does a bunch of other things too that you might not want and that have a bigger impact on performance. With the version 17 compiler, you can use -fp-model consistent to get the strictest language and IEEE conformance and results that should be the same at different optimization levels and on different processors of the same architecture. See the above article for more details.

One final remark: it's best not to rely on floating-point computations to give exact results. Consider

result = a*b - c*d
if(result.eq.0.0) then
...

This has given rise to bug reports in the past, when result is not exactly zero when expected, for the same reasons as above. The right way to do this is

if(abs(result) .lt. some_tolerance) then 
...

where the tolerance will depend on the expected magnitude of a*b.

jimdempseyatthecove · ‎04-12-2017

>>A rather ugly way to prevent it would be to compute each product separately and store in a temporary variable that is declared VOLATILE.

I that this is better than the alternatives. Ryan would have to assess the impact on performance.

This might be a good example to consider an intrinsic function that performs the equivalent to a write&read of a volatile without using memory.

y = AFFIX(x*n) - FLOOR(AFFIX(x*n)) ! AFFIX is hypothetical intrinsic

Jim Dempsey

Martyn_C_Intel · ‎04-14-2017

Intel C/C++ has memory fence intrinsics _mm_mfence, _mm_lfence and _mm_sfence. Intel Fortran does not, but the C intrinsics can be called from Fortran using the C-Fortran interoperability features of Fortran2003. You could use this to achieve the same thing as VOLATILE. You would declare an interface, e.g.

interface 
  subroutine ftn_mfence() bind (C, name = "_mm_mfence") 
     !DIR$ attributes known_intrinsic, default :: ftn_mfence
  end subroutine ftn_mfence
end interface

t1 = x*N
t2 = FLOOR(t1)
call ftn_mfence()
y = t1 - t2

Caveat - I haven't actually tested this. Printing the values of t1 and t2 before computing y would quite likely achieve the same thing.