Commutativity of Floating Point Arithmetic

gongo · ‎03-22-2011

That FP arithmetic can be tricky in multi-threaded environment is well known. However, I am getting strange results depending on whether I use -O2 optimization or not, even though the code itself is serial. In particular, in the enclosed sample code, four complex arrays, z1, z2, z3, and z4, are initialized with random numbers so that their value ranges are pairwise at different scales. Then the operation z1*z2 + z3*z4 performed elementwise and stored in an array z. Finally sum(z) is computed. If I compile with -O2 flag, the result depends on whether the operands are in the above order, or like: z2*z1 + z4*z3 (controlled at runtime by a command line argument). If I compile with -g, or with -fp-model precise, the two results are identical. Can anyone explain why this could be happening?

Thanks,
George

mecej4 · ‎03-22-2011

A one-liner answer is: "Ahem, random numbers are random!"

According to the Fortran standard, the sequence of random numbers you generate is processor dependent.
The Intel Fortran Reference manual states in the entry for RANDOM_NUMBER:

"If RANDOM_SEED is not used, the processor sets the seed for RANDOM_NUMBER to a processor-dependent value."

Since you are computing SUM( z1.z2 + z3.z4) and SUM(z2.z1 + z4.z3) in separate runs, you may be using random sequences generated from different seeds. Thus, you end up comparing two numbers (the sums) which are also random numbers, albeit with much smaller standard deviations.

If you want to make a valid comparison test for commutativity, you should explicitly do CALL RANDOM_SEED, and with the same initial set of seeds.

Or, to avoid having to set the seeds, in your program declare za and zb instead of z, and compute and print as follows.

za = z1*z2 + z3*z4
zb = z2*z1 + z4*z3

print *, sum(za)
print *, sum(zb)

Tim_Gallagher · ‎03-23-2011

I disagree about the random numbers. If you do not call RANDOM_SEED, the seed is set at compile time and will always generate the same random numbers in the same order every time you call the program. If you recompile, then the random numbers change. For example:

[fortran]PROGRAM RandTest
   IMPLICIT NONE

   REAL :: randNum
   INTEGER :: I


   DO I = 1, 10
      CALL RANDOM_NUMBER(randNum)
      PRINT *, randNum
   END DO
END PROGRAM RandTest
[/fortran]

And then I run it:

[bash]|tgallagher@harpy:test|> ./a.out 
  3.9208680E-07
  2.5480442E-02
  0.3525161    
  0.6669145    
  0.9630555    
  0.8382882    
  0.3353550    
  0.9153272    
  0.7958636    
  0.8326931    
|tgallagher@harpy:test|> ./a.out 
  3.9208680E-07
  2.5480442E-02
  0.3525161    
  0.6669145    
  0.9630555    
  0.8382882    
  0.3353550    
  0.9153272    
  0.7958636    
  0.8326931    
|tgallagher@harpy:test|> ./a.out 
  3.9208680E-07
  2.5480442E-02
  0.3525161    
  0.6669145    
  0.9630555    
  0.8382882    
  0.3353550    
  0.9153272    
  0.7958636    
  0.8326931    
[/bash]

As you can see, all the numbers are the same and in the same order because the seed is set at compile time. So, provided the code isn't recompiled, the differences in the sum's is due to floating point math consistency and not due to different random numbers.

This is discussed here:
http://software.intel.com/en-us/articles/consistency-of-floating-point-results-using-the-intel-compiler/

Tim

mecej4 · ‎03-23-2011

Well, I have to see the magnitudes of the differences that the OP alluded to. I stand by my position that one cannot count on how a program will behave when the Fortran Standard says it is processor-dependent.

Steven_L_Intel1 · ‎03-23-2011

The way our implementation works is that if you do not call RANDOM_SEED, the seed chosen is a fixed value. If you call RANDOM_SEED with no arguments, a seed based on the time-of-day clock is used.

That said, the order of operations, and vectorization, can slightly change values. Floating point arithmetic operations are often not computationally communtative.

TimP · ‎03-23-2011

You may be interested in checking whether both loops are vectorized, and whether all arrays are aligned, and then treated by the compiler as aligned. In principle, the recent compilers should take care to produce same results regardless of alignment on this kind of loop.

mecej4 · ‎03-23-2011

Here are the results from one run, in which the same z1 and z2 arrays were used to compute za and zb using vector assignments rather than loops, as in #1:

za: (1.786555654004733E-007,1.688695763736762E-007)
zb: (1.786555654004733E-007,1.688695763736758E-007)

There is a discrepancy of about 2 _m in the imaginary part, and none in the real part.

Considering that 2.6 million complex numbers were added, the evaluation of each of which involved 2 X ( 4 multiplications + 3 additions ), I think that the discrepancy is negligibly small, and that fact is not surprising either; because the random numbers used are uniformly generated between (-a,a) and (-b, b), one need not fear accumulation of round-off error.

I have deliberately omitted mention of the compiler version, options, etc. I think that I need to be convinced that there is a problem to be solved.

I also ask whether we should not be more worried about the value itself (about 2.10^-7) being unacceptably large, given that the expected value is zero and the computed value is over 1000 times the range of magnitudes of za, zb.

mecej4 · ‎03-23-2011

Steve, I have two related comments/questions. One is that the documentation of RANDOM_SEED at

http://software.intel.com/sites/products/documentation/hpc/composerxe/en-us/fortran/lin/lref_for/source_files/rfrandms.htm

says, in the entry for RANDOM_SEED that "...If no argument is specified, a random number based on the date and time is assigned to the seed." This does not quite agree with the first line of output from the program

[fortran]program getseed
  integer :: n,seed(10)

  call random_seed(size=n)
  if(n.gt.10)stop 'seed array too small'

  call random_seed(get=seed)
  write(*,10)'Before PUT : ',seed(1:n)

  seed(1:n)=seed(1:n)*2
  call random_seed(put=seed)
  call random_seed(get=seed)
  write(*,10)'After  PUT : ',seed(1:n)

10 format(A,10(2x,Z11))

end program getseed
[/fortran]

The first line did not change even after I changed the PC clock to a different date.

[bash]Before PUT :      7FFFFFAA     7FFFFF06
After  PUT :      7FFFFFAA     7FFFFF06
[/bash]

The second comment is that I did not expect the second line to be the same as the first. With other compilers, I do see that the second line has values twice those of the first (with possible overflow ignored)

Tim_Gallagher · ‎03-23-2011

So I can confirm that your example gives the same seeds/random numbers. But we cooked up an example that generates a difference:

[fortran]MODULE Random_m
   IMPLICIT NONE
   PRIVATE
   PUBLIC :: InitializeRandomSeed

CONTAINS
   SUBROUTINE InitializeRandomSeed(seed_input)
      INTEGER, INTENT(IN) :: seed_input
      INTEGER :: i, n, clock
      INTEGER, DIMENSION(:), ALLOCATABLE :: seed
     
      CALL RANDOM_SEED(SIZE=n)
      ALLOCATE(seed(n))

      IF (seed_input > 0) THEN
         seed = seed_input
      ELSE
         CALL SYSTEM_CLOCK(COUNT=clock)
         seed = clock + 37 * (/ (i - 1, i = 1, n) /)   
      END IF

      CALL RANDOM_SEED(PUT=seed)
      CALL RANDOM_SEED(GET=seed)
      PRINT *, seed
      DEALLOCATE(seed)
   END SUBROUTINE InitializeRandomSeed
END MODULE Random_m

PROGRAM TestRandom
   USE Random_m
   INTEGER :: i, n, rsInput
   REAL :: x
   n = 10
   rsInput = -1

   CALL InitializeRandomSeed(rsInput)
   DO i = 1, n
      CALL RANDOM_NUMBER(x)
      WRITE(*,*) 'x = ', x
   END DO

   WRITE(*,*) 'AGAIN'
   CALL InitializeRandomSeed(rsInput)
   DO i = 1, n
      CALL RANDOM_NUMBER(x)
      WRITE(*,*) 'x = ', x
   END DO
END PROGRAM TestRandom
[/fortran]

Ironically, the random numbers/seeds are different on Intel but when we compile and run with gfortran, the seeds and random numbers are always the same before and after the second call to IntializeRandomSeed.

Tim

mecej4 · ‎03-23-2011

This is a shot in the dark: perhaps the Intel RNG needs to be primed by having it generate a few output numbers before it actually uses a new seed given to it with a PUT request.

GFortran (I have V 4.5) uses an 8-element seed vector. Your example in #8 sends it a single value, which is spread to the whole vector by the statement

[fortran]  IF (seed_input > 0) THEN  
          seed = seed_input
[/fortran]

It would not be surprising if GFortran ignored a PUT request with an incomplete/degenerate seed vector.

Steven_L_Intel1 · ‎03-23-2011

mecej4, I don't see that your program called RANDOM_SEED with no arguments. But I agree that the behavior is unexpected and I will look into that. I modified the program to call RANDOM_SEED with no arguments and was puzzled to see that I got the same seed back each time - didn't expect that either. I'll have a word or three with the library developers tomorrow about that.

mecej4 · ‎03-23-2011

Steve,

> I don't see that your program called RANDOM_SEED with no arguments.

Quite so. I assumed that if RANDOM_NUMBER was called with no preceding call to RANDOM_SEED in the current process, the state of the RNG would be the same as if an implicit call to RANDOM_NUMBER had been made with no arguments.

Thanks for looking into this.

Tim_Gallagher · ‎03-23-2011

So now I'm more confused.

The example calls the Initialize routine with -1 always, so it goes into the ELSE clause that uses SYSTEM_CLOCK.

So I modified it so that the seed is no longer degenerate by:

[fortran]      IF (seed_input > 0) THEN
         seed = seed_input
      ELSE
         DO I = 1, n
            CALL SYSTEM_CLOCK(COUNT=clock)
            seed(I) = clock*(I-1)
         END DO
      END IF[/fortran]

With that change, gfortran generates different random numbers but now this example with Intel generates the same random numbers and the same seed after the PUT.

Tim

jimdempseyatthecove · ‎03-24-2011

George,

The problem you are observing in apparent non-commutativity is your additions are saturating the mantissa of the floating point numbers. This means you are experiencing rounding errors at different points in the calculations dependent on the sequence of operations.

Jim Dempsey

jimdempseyatthecove · ‎03-24-2011

>>If I compile with -g, or with -fp-model precise, the two results are identical

With those options, the generated code uses FPU instructions as opposed to SSE instructions. FPU internal TEMP (FPU stack) are 80-bit (greater precision than double). Therefore you will see less round off error (no apparent error in many more cases). Rounding errors are a fact of life with some sequences of arithmatic.

Jim Dempsey

TimP · ‎03-24-2011

The "feature" of changing from SSE to X87 at -O0 (implied by -g, when no -O level is set) is present only in 11.1 and earlier compilers. As I understand it, the 12.0 compilers generate X87 code only when explicitly requested (by the 32-bit -mia32 option).
-fp-model precise (same as -fp-model source, for ifort, but not for icc) by itself doesn't change from SSE to X87 instructions. It does restore gradual underflow and strict compliance in the C as well as Fortran sense with the order of operations implied in your source code.
In the 12.0 compilers, the new option -standard-semantics is available to comply more fully with Fortran standards, apparently without setting gradual underflow. I haven't checked whether it disables optimization of sum and dot product reductions, as -fp-model precise|source do.

Steven_L_Intel1 · ‎03-24-2011

I've got some interesting info regarding RANDOM_SEED. It seems that if you do a PUT with a seed that has negative values, it ignores the PUT. At least this is the way it looks to me. More investigation is needed.

Calling RANDOM_SEED with no arguments does use a time-based seed and PUT does work otherwise. I have attached a program which demonstrates this.

Steven_L_Intel1 · ‎03-24-2011

Ok, mystery solved. I was clued in by the following text in the Fortran standard:

The value returned by GET
need not be the same as the value specied by PUT in an immediately preceding reference to RANDOM SEED.

For example, following execution of the statements

CALL RANDOM_SEED (PUT=SEED1)
CALL RANDOM_SEED (GET=SEED2)

SEED2 need not equal SEED1. When the values dier, the use of either value as the PUT argument in a
subsequent call to RANDOM SEED shall result in the same sequence of pseudorandom numbers being generated.

For example, after execution of the statements

CALL RANDOM_SEED (PUT=SEED1)
CALL RANDOM_SEED (GET=SEED2)
CALL RANDOM_NUMBER (X1)
CALL RANDOM_SEED (PUT=SEED2)
CALL RANDOM_NUMBER (X2)

X2 equals X1.

In the test programs here, the static initial seed was retrieved, multiplied by 2, and then PUT, but when GET was done, the "old value" was retrieved. I found that BOTH values in fact delivered the same RANDOM_NUMBER harvest. So RANDOM_SEED is doing some "filtering" of the incoming seed value to make sure it is within proper limits for the algorithm. No bug.

gongo · ‎03-24-2011

Thanks to Jim Dempsey and TimP for explaining the effect of fp-model and -O2 compiler options on the swtich to/from SSE instruction set. It looks like the only thing that makes a difference in this case is the fact that the FPU uses an 80-bit register for intermediate results. What is still puzlling is why, as it turns out, the difference appears only in the imaginary part?

The discussion on the random number generation is very interesting, but irrelevant for this problem (perhaps move to a new thread?). One can see the result of the non-commutative operations by using, for instance, the following (very deterministic) constants:

a = dcmplx(1.102323407845943d-10, -2.132323407845943d-10)
b = dcmplx(-2.392085985049895d-3, 3.3989085985049895d-3)
c = dcmplx(6.293823948579384d-10, 8.291823948579384d-10)
d = dcmplx(8.129834587968582d-3, 1.1298342879685822d-3)

The code output for a*b + c*d is:
(4.641008061166196E-012,8.336953268281271E-012)
while the output for b*a + d*c is
(4.641008061166196E-012,8.336953268281273E-012)

The reason I used randomly generated arrays was to save myself the trouble of finding specific quadrupules of complex numbers with this non-commutative property (it's not that easy to do manually). The sum of all elements is simply used as a scalar diagnostic, although it most likely contributes further to the discrepancy because of differences in the order of addition between SSE and FPU. In a modified version of the test code, I've done pairwise comparison of the elements of the two z arrays and it turns out that about 2/3 are different.

--George

mecej4 · ‎03-24-2011

> What is still puzlling is why, as it turns out, the difference appears only in the imaginary part.

It is easy to see that the perceived difference is an artifact of binary->decimal conversion, at least for the scalar example that you showed in #18.

If you are going to compare 64-bit reals to the last bit, please print using hexadecimal notation.

[fxfortran]      program cm
      double complex a,b,c,d,z1,z2
      integer iz1(2 4),iz2(2 4)
      equivalence (iz1,z1),(iz2,z2)
      a = dcmplx(1.102323407845943d-10, -2.132323407845943d-10)
      b = dcmplx(-2.392085985049895d-3, 3.3989085985049895d-3)
      c = dcmplx(6.293823948579384d-10, 8.291823948579384d-10)
      d = dcmplx(8.129834587968582d-3, 1.1298342879685822d-3)
      z1=a*b+c*d
      z2=b*a+d*c
      write(*,*)z1
      write(*,*)z2
      write(*,20)iz1(2),iz1(1),iz1(4),iz1(3) (underlined part added later)
      write(*,20)iz2(2),iz2(1),iz2(4),iz2(3)
   20 format(1x,2Z8,2x,2Z8)
      end
[/fxfortran]

gives the result

[bash] (4.641008061166196E-012,8.336953268281273E-012)
 (4.641008061166196E-012,8.336953268281271E-012)
 3D94694F7FCB1DC7  (this is the real part; see below)
 3D94694F7FCB1DC7
[/bash]

GFortran seems to do the formatting slightly better:

[bash] ( 4.64100806116619663E-012, 8.33695326828127092E-012)
 ( 4.64100806116619663E-012, 8.33695326828127092E-012)
 3D94694F7FCB1DC8
 3D94694F7FCB1DC8
[/bash]

You may wish to explore whether the same explanation applies to your original test with big arrays of random elements.

Note, added subsequently:

The above program was in error, and I offer my apologies. Here is a corrected version.

[fxfortran]      program cm
      double complex a,b,c,d,z1,z2
      integer iz1(4),iz2(4)
      equivalence (iz1,z1),(iz2,z2)
      a = dcmplx(1.102323407845943d-10, -2.132323407845943d-10)
      b = dcmplx(-2.392085985049895d-3, 3.3989085985049895d-3)
      c = dcmplx(6.293823948579384d-10, 8.291823948579384d-10)
      d = dcmplx(8.129834587968582d-3, 1.1298342879685822d-3)
      z1=a*b+c*d
      z2=b*a+d*c
      write(*,*)z1
      write(*,*)z2
      write(*,20)iz1(2),iz1(1),iz1(4),iz1(3)
 write(*,20)iz2(2),iz2(1),iz2(4),iz2(3)
 20 format(1x,2Z8,2x,2Z8)

      end[/fxfortran]

The results from the corrected program, compiled using the 12.0.2 compiler, with /Od, show that all bits of z1 and z2 agree. With /O2, there is a difference in the least significant bit of the imaginary part.

[bash] (4.641008061166196E-012,8.336953268281273E-012)
 (4.641008061166196E-012,8.336953268281271E-012)
 3D94694F7FCB1DC7  3DA255499696C399
 3D94694F7FCB1DC7  3DA255499696C398



[/bash]

Since a*b is about a tenth of c*d, it is not surprising that adding them causes a loss of 1 bit of precision or that the compiler optimization level has a small but noticeable effect.

TimP · ‎03-24-2011

The strictest criteria offered by IEEE754 for WRITE conversions apply only down to 1e-10, but it's a little surprising to see this non-repeatability come in at this point, appearing to indicate an off-by-1-ULP in the first case produced by ifort.