Solved: A bug of ifort (IFORT) 2021.7.1 20221019

Zaikun · ‎12-02-2022

(Update: There is an informative discussion on Fortran Discourse about this bug)

Consider the following code.

! test.f90                                                                                                                           
program test                                                                                                                         
                                                                                                                                     
implicit none                                                                                                                                                                                                        
                                                                                                                                     
! Why these strange numbers? Because they come from a real project.                                                                  
real, parameter :: A(25) = &                                                                                                         
    & [1.2409463E+22, 2.8192031E+20, -4.4971432E+21, 1.1558917E+22, -4.7725394E+21, &                                                
    & 2.8192031E+20, 2.8192031E+20, -3.2276437E+21, -3.3936284E+21, 4.1412172E+21, &                                                 
    & -4.4971432E+21, -3.2276437E+21, -7.6997325E+20, -4.1627641E+21, -3.4729614E+20,&                                               
    & 1.1558917E+22, -3.3936284E+21, -4.1627641E+21, 2.5757004E+21, -2.4898961E+21, &                                                
    & -4.7725394E+21, 4.1412172E+21, -3.4729614E+20, -2.4898961E+21, 3.5042103E+21]                                                  
                                                                                                                                     
real :: S(5, 5)                                                                                                                      
                                                                                                                                     
S = reshape(A, [5, 5])                                                                                                               
                                                                                                                                     
! Is S symmetric?                                                                                                                    
write (*, *) all(abs(S - transpose(S)) <= 0)                                                                                         
                                                                                                                                     
! The numbers in S are horrible. Let's scale it.                                                                                     
S = S / maxval(abs(S))                                                                                                               
                                                                                                                                     
write (*, *) S                                                                                                                       
                                                                                                                                     
! Is S symmetric?                                                                                                                    
write (*, *) all(abs(S - transpose(S)) <= 0)                                                                                         
                                                                                                                                     
end program test

Here is what happens on my computer.

$ uname -a && ifort --version 

Linux 5.15.0-53-generic #59-Ubuntu SMP Mon Oct 17 18:53:30 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

ifort (IFORT) 2021.7.1 20221019
Copyright (C) 1985-2022 Intel Corporation.  All rights reserved.

$ ifort test.f90 && ./a.out 
 
T

  0.9999999      2.2718171E-02 -0.3623963      0.9314599     -0.3845887    
  2.2718171E-02  2.2718171E-02 -0.2600954     -0.2734710      0.3337145    
 -0.3623963     -0.2600954     -6.2047265E-02 -0.3354508     -2.7986396E-02
  0.9314599     -0.2734710     -0.3354508      0.2075594     -0.2006450    
 -0.3845887      0.3337145     -2.7986394E-02 -0.2006450      0.2823821    

 F

Note that S(3,5) /= S(5,3) after the scaling. This happens when ifort is invoked with -O, -O2, -O3, -O4, or -O5, but not -g, -O0, -O1, or -Ofast.

Is this expected? It seems like a bug to me.

Ron_Green · ‎12-06-2022

Posted this to Fortran Discourse https://fortran-lang.discourse.group/t/strange-behavior-of-ifort/4846/47

I have some insight into this issue. The mathematics is actually reproducible. The insight came from changing the data in the 5x5, and simplifying the scaling to a simple division. Then using -qopt-report=5 and -S and examining the data in both hex and default float format

Here is my modified test

```

program test

implicit none

integer :: row , col

! Why these strange numbers? Because they come from a real project.

real, parameter :: A(25) = &

& [-3.4729614E+20, -3.4729614E+20, -3.4729614E+20, -3.4729614E+20, -3.4729614E+20, &

& -3.4729614E+20, -3.4729614E+20, -3.4729614E+20, -3.4729614E+20, -3.4729614E+20, &

& -3.4729614E+20, -3.4729614E+20, -3.4729614E+20, -3.4729614E+20, -3.4729614E+20,&

& -3.4729614E+20, -3.4729614E+20, -3.4729614E+20, -3.4729614E+20, -3.4729614E+20, &

& -3.4729614E+20, -3.4729614E+20, -3.4729614E+20, -3.4729614E+20, -3.4729614E+20]

real :: S(5, 5)

S = reshape(A, [5, 5])

! The numbers in S are horrible. Let's scale it.

!S = S / maxval(abs(S))

do col=1,5

do row=1,5

S(row,col) = S(row,col) / 1.2409463E+22

end do

write (*,*) "S after"

do row=1,5

write(*,'(5(Z8,3x))') S(row,:)

end do

print *, ""

write (*, *) S

! Is S symmetric?

write (*, *)'Is S symmetric?', all(abs(S - transpose(S)) <= 0)

end program test

```

Now just run a simple compilation

```

]$ ifort test3.f90

$ ./a.out

S after

BCE543B9 BCE543B9 BCE543B9 BCE543B9 BCE543B9

BCE543BA BCE543BA BCE543BA BCE543BA BCE543BA

-2.7986394E-02 -2.7986394E-02 -2.7986394E-02 -2.7986394E-02 -2.7986396E-02

Is S symmetric? F

```

Do you see the pattern? It is stochastic. Not random.

Now, with inside knowledge I know that default, no -O option, gives us -O2 which vectorizes with SSE2, 128 bit registers. You can also verify this by using -dryrun:

```

$ ifort test3.f90 -dryrun |& grep -i sse

-D__SSE__ \

-D__SSE_MATH__ \

-D__SSE2__ \

-D__SSE2_MATH__ \

```

Here we are telling the backend it MAY use SSE2. Not demanding, but suggesting to the backend. You'll see the same define in the -prec-div case - but remember, it's a suggestion not a directive. It may use SSE2, that is all the define is doing. the backend will see -prec-div and decide how to process the loop nest.

How many 32bit data items in the XMM registers? 4. Go down the row, in memory order - see the pattern? at optimization, we're packing 4 elements into an XMM register and doing vectorized math. For the 5th element in each row we have a remainder that is not done in a packed vector register with vector instructions. Repeat this for each column. You can see this in the OPT REPORT.

Here's our loop of interest:

```

19 !S = S / maxval(abs(S))

20 do col=1,5

21 do row=1,5

22 S(row,col) = S(row,col) / 1.2409463E+22

23 end do

24 end do

```

and the opt-report on that loop

```

LOOP BEGIN at test3.f90(20,1)

remark #15542: loop was not vectorized: inner loop was already vectorized

LOOP BEGIN at test3.f90(21,3)

remark #15389: vectorization support: reference s(row,col) has unaligned access [ test3.f90(22,4) ]

remark #15381: vectorization support: unaligned access used inside loop body

remark #15305: vectorization support: vector length 4

remark #15309: vectorization support: normalized vectorization overhead 0.158

remark #15300: LOOP WAS VECTORIZED

remark #15450: unmasked unaligned unit stride loads: 1

remark #15451: unmasked unaligned unit stride stores: 1

remark #15475: --- begin vector cost summary ---

remark #15476: scalar cost: 30

remark #15477: vector cost: 9.500

remark #15478: estimated potential speedup: 2.020

remark #15486: divides: 1

remark #15488: --- end vector cost summary ---

remark #25015: Estimate of max trip count of loop=1

LOOP END

LOOP BEGIN at test3.f90(21,3)

remark #25436: completely unrolled by 1

LOOP END

```

Reading this, lots of good stuff. It's vectorizing the inner loop, the 5 elements in a column. But of course the XMM registers process 128 bits or 4 elements. So there is a leftover, which is done in differently that the other 4. Note also that we didn't use -align array32byte and tag the loop as aligned, which might help efficiency here. but that is just an additional optimization we could do manually, not necessary for this discussion.

In the original example, element (3,5) is in a vector mode. Element (5,3) is a remainder, not in a vector mode. Examining the assembly with both cases with and without -prec-div you can see the vector mode, specifically this:

< rcpps %xmm2, %xmm0 #22.4

which is a reciprocal vector instruction of lesser precision. This is done for the first 4 elements. The remainder is done with a div.

< divss %xmm1, %xmm3 #22.4

whereas the code with -prec-div uses div everywhere

> divps %xmm1, %xmm2 #22.4

So - predictable, reproducible results.

Turning on prec-div OR you can also use -no-vec, same effect.

View solution in original post

FortranFan · ‎12-03-2022

Dear Intel Team members,

Here is a comment re: -fp:fast compiler option with IFORT versus other options at the Fortran DIscourse site that may be of interest here.

FortranFan · ‎12-03-2022

Intel Forum Admin: is this thread better suited for the Fortran forum since it appears a matter with IFORT product in HPC toolkit? If so, can you please move it to

https://community.intel.com/t5/Intel-Fortran-Compiler/bd-p/fortran-compiler

Ron_Green · ‎12-05-2022

I'm reading the discussion on Fortran Discourse. I will investigate.

ron

Ron_Green · ‎12-05-2022

This is an interesting little example for many reasons. Where to begin. Let's start with some simple things

1. Intel compilers when invoked without an explicit -O option defaults to O2. And Intel's O2 is aggressive.

2. Optimizations on most compilers come at the expense of precision. You have to balance your needs for highest FP precision and performance.

3. Division, amongst other mathematical expressions and intrinsics, may be done differently depending on optimization level.

4. Every compiler does optimization differently. Does not mean one is 'right' and the other 'wrong'. Just different choices in balancing precision and speed.

Skip to the solution: For IFORT use

-prec-div

option to force precise divisions even under optimzation. Read up on this option [HERE](https://www.intel.com/content/www/us/en/develop/documentation/fortran-compiler-oneapi-dev-guide-and-reference/top/compiler-reference/compiler-options/floating-point-options/prec-div-qprec-div.html). -fp-model is a macro option that sets a number of different FP behaviors. PRECISE triggers -prec-div amongst other behaviors.

You can read on if you want more details.

I'll start with #4. As noted, IFX and IFORT behave differently with this example. Our LLVM-based IFX optimization is radically different than that employed with our legacy proprietary IFORT compiler. You SHOULD expect possible FP differences between these 2 compilers under optimizations. Note in the documentation on -prec-div that this is only an option for IFORT. Why? LLVM FP behavior is different than on our Classic proprietary compiler.

On point #1 above: When you encounter questions about 'what is the compiler doing?' there are easier ways than dumping assembly and reading that. First, use

-dryrun

option. This will show you all internal defines and options passed to the compiler. It's quite helpful.

Next, with IFORT use the

-qopt-report=5

option and examine the .optrpt output.

On Point # 3: Now doing this, and comparing the differences between -prec-div and not at O2 you can see the "Code gen" or code generator is different - difference in number of temporary values ("Locals"). This was a clear indicator that strategy used by the code generator for the division is quite different. Again, read the -O2 and -prec-div options in the [Developer Guide](https://www.intel.com/content/www/us/en/develop/documentation/fortran-compiler-oneapi-dev-guide-and-reference/top/compiler-reference/compiler-options/alphabetical-option-list.html)

On point #1, O2 default. Intel prioritizes performance by default "out of box", favoring that over numerical precision when you do not bother to add additional compiler options (if you don't give the compiler any -O options). This philosophy can and has been argued in the community. Some ASSUME that without -O that you would get O0. IFORT has been around for a lot of years and is widely adopted. Where possible, we try to make IFX "out of the box" as close to behavior of IFORT as possible. So no surprises for existing customers, O2 is default. But if you are coming from FLang or gfortran then this default of O2 may be a surprise to you.

Thanks for showing us this example. It was quite interesting.

ron

#IAmIntel

jimdempseyatthecove · ‎12-06-2022

>>Note in the documentation on -prec-div that this is only an option for IFORT.

Does this imply ifx uses the equivalent precision as ifort with -prec-div? .OR. not?

If not, is there a similar functioning option for ifx as -prec-div for ifort?

Jim Dempsey

Ron_Green · ‎12-06-2022

Posted this to Fortran Discourse https://fortran-lang.discourse.group/t/strange-behavior-of-ifort/4846/47

I have some insight into this issue. The mathematics is actually reproducible. The insight came from changing the data in the 5x5, and simplifying the scaling to a simple division. Then using -qopt-report=5 and -S and examining the data in both hex and default float format

Here is my modified test

```

program test

implicit none

integer :: row , col

! Why these strange numbers? Because they come from a real project.

real, parameter :: A(25) = &

& [-3.4729614E+20, -3.4729614E+20, -3.4729614E+20, -3.4729614E+20, -3.4729614E+20, &

& -3.4729614E+20, -3.4729614E+20, -3.4729614E+20, -3.4729614E+20, -3.4729614E+20, &

& -3.4729614E+20, -3.4729614E+20, -3.4729614E+20, -3.4729614E+20, -3.4729614E+20,&

& -3.4729614E+20, -3.4729614E+20, -3.4729614E+20, -3.4729614E+20, -3.4729614E+20, &

& -3.4729614E+20, -3.4729614E+20, -3.4729614E+20, -3.4729614E+20, -3.4729614E+20]

real :: S(5, 5)

S = reshape(A, [5, 5])

! The numbers in S are horrible. Let's scale it.

!S = S / maxval(abs(S))

do col=1,5

do row=1,5

S(row,col) = S(row,col) / 1.2409463E+22

end do

write (*,*) "S after"

do row=1,5

write(*,'(5(Z8,3x))') S(row,:)

end do

print *, ""

write (*, *) S

! Is S symmetric?

write (*, *)'Is S symmetric?', all(abs(S - transpose(S)) <= 0)

end program test

```

Now just run a simple compilation

```

]$ ifort test3.f90

$ ./a.out

S after

BCE543B9 BCE543B9 BCE543B9 BCE543B9 BCE543B9

BCE543BA BCE543BA BCE543BA BCE543BA BCE543BA

-2.7986394E-02 -2.7986394E-02 -2.7986394E-02 -2.7986394E-02 -2.7986396E-02

Is S symmetric? F

```

Do you see the pattern? It is stochastic. Not random.

Now, with inside knowledge I know that default, no -O option, gives us -O2 which vectorizes with SSE2, 128 bit registers. You can also verify this by using -dryrun:

```

$ ifort test3.f90 -dryrun |& grep -i sse

-D__SSE__ \

-D__SSE_MATH__ \

-D__SSE2__ \

-D__SSE2_MATH__ \

```

Here we are telling the backend it MAY use SSE2. Not demanding, but suggesting to the backend. You'll see the same define in the -prec-div case - but remember, it's a suggestion not a directive. It may use SSE2, that is all the define is doing. the backend will see -prec-div and decide how to process the loop nest.

How many 32bit data items in the XMM registers? 4. Go down the row, in memory order - see the pattern? at optimization, we're packing 4 elements into an XMM register and doing vectorized math. For the 5th element in each row we have a remainder that is not done in a packed vector register with vector instructions. Repeat this for each column. You can see this in the OPT REPORT.

Here's our loop of interest:

```

19 !S = S / maxval(abs(S))

20 do col=1,5

21 do row=1,5

22 S(row,col) = S(row,col) / 1.2409463E+22

23 end do

24 end do

```

and the opt-report on that loop

```

LOOP BEGIN at test3.f90(20,1)

remark #15542: loop was not vectorized: inner loop was already vectorized

LOOP BEGIN at test3.f90(21,3)

remark #15389: vectorization support: reference s(row,col) has unaligned access [ test3.f90(22,4) ]

remark #15381: vectorization support: unaligned access used inside loop body

remark #15305: vectorization support: vector length 4

remark #15309: vectorization support: normalized vectorization overhead 0.158

remark #15300: LOOP WAS VECTORIZED

remark #15450: unmasked unaligned unit stride loads: 1

remark #15451: unmasked unaligned unit stride stores: 1

remark #15475: --- begin vector cost summary ---

remark #15476: scalar cost: 30

remark #15477: vector cost: 9.500

remark #15478: estimated potential speedup: 2.020

remark #15486: divides: 1

remark #15488: --- end vector cost summary ---

remark #25015: Estimate of max trip count of loop=1

LOOP END

LOOP BEGIN at test3.f90(21,3)

remark #25436: completely unrolled by 1

LOOP END

```

Reading this, lots of good stuff. It's vectorizing the inner loop, the 5 elements in a column. But of course the XMM registers process 128 bits or 4 elements. So there is a leftover, which is done in differently that the other 4. Note also that we didn't use -align array32byte and tag the loop as aligned, which might help efficiency here. but that is just an additional optimization we could do manually, not necessary for this discussion.

In the original example, element (3,5) is in a vector mode. Element (5,3) is a remainder, not in a vector mode. Examining the assembly with both cases with and without -prec-div you can see the vector mode, specifically this:

< rcpps %xmm2, %xmm0 #22.4

which is a reciprocal vector instruction of lesser precision. This is done for the first 4 elements. The remainder is done with a div.

< divss %xmm1, %xmm3 #22.4

whereas the code with -prec-div uses div everywhere

> divps %xmm1, %xmm2 #22.4

So - predictable, reproducible results.

Turning on prec-div OR you can also use -no-vec, same effect.

Zaikun · ‎12-06-2022

Thank you very much for this wonderful explanation. Everything makes sense now. So the behavior is a side effect of optimization, not a bug.

andrew_4619 · ‎12-07-2022

The combined FP operation and remainder FP operation was my expectation of what was happening, it is good to see that is confirmed now. I guess the main point of note is that comparisons of real numbers to zero is not a robust way of coding and can bite back!