Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.
28597 Discussions

A bug of ifort (IFORT) 2021.7.1 20221019

Zaikun
New Contributor I
1,258 Views

(Update: There is an informative discussion on Fortran Discourse about this bug)

Consider the following code.

 

! test.f90                                                                                                                           
program test                                                                                                                         
                                                                                                                                     
implicit none                                                                                                                                                                                                        
                                                                                                                                     
! Why these strange numbers? Because they come from a real project.                                                                  
real, parameter :: A(25) = &                                                                                                         
    & [1.2409463E+22, 2.8192031E+20, -4.4971432E+21, 1.1558917E+22, -4.7725394E+21, &                                                
    & 2.8192031E+20, 2.8192031E+20, -3.2276437E+21, -3.3936284E+21, 4.1412172E+21, &                                                 
    & -4.4971432E+21, -3.2276437E+21, -7.6997325E+20, -4.1627641E+21, -3.4729614E+20,&                                               
    & 1.1558917E+22, -3.3936284E+21, -4.1627641E+21, 2.5757004E+21, -2.4898961E+21, &                                                
    & -4.7725394E+21, 4.1412172E+21, -3.4729614E+20, -2.4898961E+21, 3.5042103E+21]                                                  
                                                                                                                                     
real :: S(5, 5)                                                                                                                      
                                                                                                                                     
S = reshape(A, [5, 5])                                                                                                               
                                                                                                                                     
! Is S symmetric?                                                                                                                    
write (*, *) all(abs(S - transpose(S)) <= 0)                                                                                         
                                                                                                                                     
! The numbers in S are horrible. Let's scale it.                                                                                     
S = S / maxval(abs(S))                                                                                                               
                                                                                                                                     
write (*, *) S                                                                                                                       
                                                                                                                                     
! Is S symmetric?                                                                                                                    
write (*, *) all(abs(S - transpose(S)) <= 0)                                                                                         
                                                                                                                                     
end program test

 

Here is what happens on my computer.

 

$ uname -a && ifort --version 

Linux 5.15.0-53-generic #59-Ubuntu SMP Mon Oct 17 18:53:30 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

ifort (IFORT) 2021.7.1 20221019
Copyright (C) 1985-2022 Intel Corporation.  All rights reserved.

$ ifort test.f90 && ./a.out 
 
T

  0.9999999      2.2718171E-02 -0.3623963      0.9314599     -0.3845887    
  2.2718171E-02  2.2718171E-02 -0.2600954     -0.2734710      0.3337145    
 -0.3623963     -0.2600954     -6.2047265E-02 -0.3354508     -2.7986396E-02
  0.9314599     -0.2734710     -0.3354508      0.2075594     -0.2006450    
 -0.3845887      0.3337145     -2.7986394E-02 -0.2006450      0.2823821    

 F

 

Note that S(3,5) /= S(5,3) after the scaling. This happens when ifort is invoked with -O, -O2, -O3, -O4, or -O5, but not -g, -O0, -O1, or -Ofast. 

 

Is this expected? It seems like a bug to me. 

0 Kudos
1 Solution
Ron_Green
Moderator
1,065 Views

Posted this to Fortran Discourse https://fortran-lang.discourse.group/t/strange-behavior-of-ifort/4846/47



I have some insight into this issue. The mathematics is actually reproducible. The insight came from changing the data in the 5x5, and simplifying the scaling to a simple division. Then using -qopt-report=5 and -S and examining the data in both hex and default float format

Here is my modified test

```

program test                                                             

                                                                   

implicit none                                                                                                     

integer :: row , col                                                                   

! Why these strange numbers? Because they come from a real project.                                  

real, parameter :: A(25) = &                                                     

  & [-3.4729614E+20, -3.4729614E+20, -3.4729614E+20, -3.4729614E+20, -3.4729614E+20, &                         

  & -3.4729614E+20, -3.4729614E+20, -3.4729614E+20, -3.4729614E+20, -3.4729614E+20, &                         

  & -3.4729614E+20, -3.4729614E+20, -3.4729614E+20, -3.4729614E+20, -3.4729614E+20,&                        

  & -3.4729614E+20, -3.4729614E+20, -3.4729614E+20, -3.4729614E+20, -3.4729614E+20, &                         

  & -3.4729614E+20, -3.4729614E+20, -3.4729614E+20, -3.4729614E+20, -3.4729614E+20]                          

                                                                   

real :: S(5, 5)                                                            

                                                                   

S = reshape(A, [5, 5])                                                        

                                                                   

! The numbers in S are horrible. Let's scale it.                                           

!S = S / maxval(abs(S))                                                        

do col=1,5

 do row=1,5

  S(row,col) = S(row,col) / 1.2409463E+22

 end do

end do

write (*,*) "S after"

do row=1,5

 write(*,'(5(Z8,3x))') S(row,:)

end do

print *, ""

                                                                   

write (*, *) S                                                            

                                                                   

! Is S symmetric?                                                           

write (*, *)'Is S symmetric?', all(abs(S - transpose(S)) <= 0)                                             


end program test


```

Now just run a simple compilation

```

]$ ifort test3.f90

$ ./a.out

 S after

BCE543B9  BCE543B9  BCE543B9  BCE543B9  BCE543B9

BCE543B9  BCE543B9  BCE543B9  BCE543B9  BCE543B9

BCE543B9  BCE543B9  BCE543B9  BCE543B9  BCE543B9

BCE543B9  BCE543B9  BCE543B9  BCE543B9  BCE543B9

BCE543BA  BCE543BA  BCE543BA  BCE543BA  BCE543BA

 

 -2.7986394E-02 -2.7986394E-02 -2.7986394E-02 -2.7986394E-02 -2.7986396E-02

 -2.7986394E-02 -2.7986394E-02 -2.7986394E-02 -2.7986394E-02 -2.7986396E-02

 -2.7986394E-02 -2.7986394E-02 -2.7986394E-02 -2.7986394E-02 -2.7986396E-02

 -2.7986394E-02 -2.7986394E-02 -2.7986394E-02 -2.7986394E-02 -2.7986396E-02

 -2.7986394E-02 -2.7986394E-02 -2.7986394E-02 -2.7986394E-02 -2.7986396E-02

 Is S symmetric? F


```

Do you see the pattern? It is stochastic. Not random.

Now, with inside knowledge I know that default, no -O option, gives us -O2 which vectorizes with SSE2, 128 bit registers. You can also verify this by using -dryrun:

```

$ ifort test3.f90 -dryrun |& grep -i sse

  -D__SSE__ \

  -D__SSE_MATH__ \

  -D__SSE2__ \

  -D__SSE2_MATH__ \

```

Here we are telling the backend it MAY use SSE2. Not demanding, but suggesting to the backend. You'll see the same define in the -prec-div case - but remember, it's a suggestion not a directive. It may use SSE2, that is all the define is doing. the backend will see -prec-div and decide how to process the loop nest.


 How many 32bit data items in the XMM registers? 4.  Go down the row, in memory order - see the pattern? at optimization, we're packing 4 elements into an XMM register and doing vectorized math. For the 5th element in each row we have a remainder that is not done in a packed vector register with vector instructions. Repeat this for each column. You can see this in the OPT REPORT.  

Here's our loop of interest:

```

   19 !S = S / maxval(abs(S))

   20 do col=1,5

   21  do row=1,5

   22  S(row,col) = S(row,col) / 1.2409463E+22

   23  end do

   24 end do


```

and the opt-report on that loop

```

LOOP BEGIN at test3.f90(20,1)

  remark #15542: loop was not vectorized: inner loop was already vectorized


  LOOP BEGIN at test3.f90(21,3)

   remark #15389: vectorization support: reference s(row,col) has unaligned access  [ test3.f90(22,4) ]

   remark #15389: vectorization support: reference s(row,col) has unaligned access  [ test3.f90(22,4) ]

   remark #15381: vectorization support: unaligned access used inside loop body

   remark #15305: vectorization support: vector length 4

   remark #15309: vectorization support: normalized vectorization overhead 0.158

   remark #15300: LOOP WAS VECTORIZED

   remark #15450: unmasked unaligned unit stride loads: 1

   remark #15451: unmasked unaligned unit stride stores: 1

   remark #15475: --- begin vector cost summary ---

   remark #15476: scalar cost: 30

   remark #15477: vector cost: 9.500

   remark #15478: estimated potential speedup: 2.020

   remark #15486: divides: 1

   remark #15488: --- end vector cost summary ---

   remark #25015: Estimate of max trip count of loop=1

  LOOP END


  LOOP BEGIN at test3.f90(21,3)

  <Remainder loop for vectorization>

   remark #25436: completely unrolled by 1

  LOOP END

LOOP END

```

Reading this, lots of good stuff. It's vectorizing the inner loop, the 5 elements in a column. But of course the XMM registers process 128 bits or 4 elements. So there is a leftover, which is done in differently that the other 4. Note also that we didn't use -align array32byte and tag the loop as aligned, which might help efficiency here. but that is just an additional optimization we could do manually, not necessary for this discussion.

 

In the original example, element (3,5) is in a vector mode. Element (5,3) is a remainder, not in a vector mode. Examining the assembly with both cases with and without -prec-div you can see the vector mode, specifically this:

<     rcpps   %xmm2, %xmm0                 #22.4

which is a reciprocal vector instruction of lesser precision. This is done for the first 4 elements. The remainder is done with a div. 

<     divss   %xmm1, %xmm3                 #22.4


whereas the code with -prec-div uses div everywhere

>     divps   %xmm1, %xmm2                 #22.4


So - predictable, reproducible results. 


Turning on prec-div OR you can also use -no-vec, same effect.



View solution in original post

8 Replies
FortranFan
Honored Contributor III
1,201 Views

Dear Intel Team members,

Here is a comment re: -fp:fast compiler option with IFORT versus other options at the Fortran DIscourse site that may be of interest here.

 

0 Kudos
FortranFan
Honored Contributor III
1,199 Views

Intel Forum Admin: is this thread better suited for the Fortran forum since it appears a matter with IFORT product in HPC toolkit?  If so, can you please move it to

https://community.intel.com/t5/Intel-Fortran-Compiler/bd-p/fortran-compiler

0 Kudos
Ron_Green
Moderator
1,132 Views

I'm reading the discussion on Fortran Discourse.   I will investigate.

 

ron

0 Kudos
Ron_Green
Moderator
1,117 Views

This is an interesting little example for many reasons. Where to begin. Let's start with some simple things


1. Intel compilers when invoked without an explicit -O option defaults to O2. And Intel's O2 is aggressive.

2. Optimizations on most compilers come at the expense of precision. You have to balance your needs for highest FP precision and performance.

3. Division, amongst other mathematical expressions and intrinsics, may be done differently depending on optimization level.

4. Every compiler does optimization differently. Does not mean one is 'right' and the other 'wrong'. Just different choices in balancing precision and speed.


Skip to the solution: For IFORT use 


-prec-div 


option to force precise divisions even under optimzation. Read up on this option [HERE](https://www.intel.com/content/www/us/en/develop/documentation/fortran-compiler-oneapi-dev-guide-and-reference/top/compiler-reference/compiler-options/floating-point-options/prec-div-qprec-div.html). -fp-model is a macro option that sets a number of different FP behaviors. PRECISE triggers -prec-div amongst other behaviors.


You can read on if you want more details.


I'll start with #4. As noted, IFX and IFORT behave differently with this example. Our LLVM-based IFX optimization is radically different than that employed with our legacy proprietary IFORT compiler. You SHOULD expect possible FP differences between these 2 compilers under optimizations. Note in the documentation on -prec-div that this is only an option for IFORT. Why? LLVM FP behavior is different than on our Classic proprietary compiler. 


On point #1 above: When you encounter questions about 'what is the compiler doing?' there are easier ways than dumping assembly and reading that. First, use


-dryrun 


option. This will show you all internal defines and options passed to the compiler. It's quite helpful.


Next, with IFORT use the 


-qopt-report=5 


option and examine the .optrpt output.  

On Point # 3: Now doing this, and comparing the differences between -prec-div and not at O2 you can see the "Code gen" or code generator is different - difference in number of temporary values ("Locals"). This was a clear indicator that strategy used by the code generator for the division is quite different. Again, read the -O2 and -prec-div options in the [Developer Guide](https://www.intel.com/content/www/us/en/develop/documentation/fortran-compiler-oneapi-dev-guide-and-reference/top/compiler-reference/compiler-options/alphabetical-option-list.html


On point #1, O2 default. Intel prioritizes performance by default "out of box", favoring that over numerical precision when you do not bother to add additional compiler options (if you don't give the compiler any -O options). This philosophy can and has been argued in the community. Some ASSUME that without -O that you would get O0. IFORT has been around for a lot of years and is widely adopted. Where possible, we try to make IFX "out of the box" as close to behavior of IFORT as possible. So no surprises for existing customers, O2 is default. But if you are coming from FLang or gfortran then this default of O2 may be a surprise to you.


Thanks for showing us this example. It was quite interesting.


ron

#IAmIntel



0 Kudos
jimdempseyatthecove
Honored Contributor III
1,077 Views

>>Note in the documentation on -prec-div that this is only an option for IFORT.

Does this imply ifx uses the equivalent precision as ifort with -prec-div? .OR. not?

If not, is there a similar functioning option for ifx as -prec-div for ifort?

 

Jim Dempsey

0 Kudos
Ron_Green
Moderator
1,066 Views

Posted this to Fortran Discourse https://fortran-lang.discourse.group/t/strange-behavior-of-ifort/4846/47



I have some insight into this issue. The mathematics is actually reproducible. The insight came from changing the data in the 5x5, and simplifying the scaling to a simple division. Then using -qopt-report=5 and -S and examining the data in both hex and default float format

Here is my modified test

```

program test                                                             

                                                                   

implicit none                                                                                                     

integer :: row , col                                                                   

! Why these strange numbers? Because they come from a real project.                                  

real, parameter :: A(25) = &                                                     

  & [-3.4729614E+20, -3.4729614E+20, -3.4729614E+20, -3.4729614E+20, -3.4729614E+20, &                         

  & -3.4729614E+20, -3.4729614E+20, -3.4729614E+20, -3.4729614E+20, -3.4729614E+20, &                         

  & -3.4729614E+20, -3.4729614E+20, -3.4729614E+20, -3.4729614E+20, -3.4729614E+20,&                        

  & -3.4729614E+20, -3.4729614E+20, -3.4729614E+20, -3.4729614E+20, -3.4729614E+20, &                         

  & -3.4729614E+20, -3.4729614E+20, -3.4729614E+20, -3.4729614E+20, -3.4729614E+20]                          

                                                                   

real :: S(5, 5)                                                            

                                                                   

S = reshape(A, [5, 5])                                                        

                                                                   

! The numbers in S are horrible. Let's scale it.                                           

!S = S / maxval(abs(S))                                                        

do col=1,5

 do row=1,5

  S(row,col) = S(row,col) / 1.2409463E+22

 end do

end do

write (*,*) "S after"

do row=1,5

 write(*,'(5(Z8,3x))') S(row,:)

end do

print *, ""

                                                                   

write (*, *) S                                                            

                                                                   

! Is S symmetric?                                                           

write (*, *)'Is S symmetric?', all(abs(S - transpose(S)) <= 0)                                             


end program test


```

Now just run a simple compilation

```

]$ ifort test3.f90

$ ./a.out

 S after

BCE543B9  BCE543B9  BCE543B9  BCE543B9  BCE543B9

BCE543B9  BCE543B9  BCE543B9  BCE543B9  BCE543B9

BCE543B9  BCE543B9  BCE543B9  BCE543B9  BCE543B9

BCE543B9  BCE543B9  BCE543B9  BCE543B9  BCE543B9

BCE543BA  BCE543BA  BCE543BA  BCE543BA  BCE543BA

 

 -2.7986394E-02 -2.7986394E-02 -2.7986394E-02 -2.7986394E-02 -2.7986396E-02

 -2.7986394E-02 -2.7986394E-02 -2.7986394E-02 -2.7986394E-02 -2.7986396E-02

 -2.7986394E-02 -2.7986394E-02 -2.7986394E-02 -2.7986394E-02 -2.7986396E-02

 -2.7986394E-02 -2.7986394E-02 -2.7986394E-02 -2.7986394E-02 -2.7986396E-02

 -2.7986394E-02 -2.7986394E-02 -2.7986394E-02 -2.7986394E-02 -2.7986396E-02

 Is S symmetric? F


```

Do you see the pattern? It is stochastic. Not random.

Now, with inside knowledge I know that default, no -O option, gives us -O2 which vectorizes with SSE2, 128 bit registers. You can also verify this by using -dryrun:

```

$ ifort test3.f90 -dryrun |& grep -i sse

  -D__SSE__ \

  -D__SSE_MATH__ \

  -D__SSE2__ \

  -D__SSE2_MATH__ \

```

Here we are telling the backend it MAY use SSE2. Not demanding, but suggesting to the backend. You'll see the same define in the -prec-div case - but remember, it's a suggestion not a directive. It may use SSE2, that is all the define is doing. the backend will see -prec-div and decide how to process the loop nest.


 How many 32bit data items in the XMM registers? 4.  Go down the row, in memory order - see the pattern? at optimization, we're packing 4 elements into an XMM register and doing vectorized math. For the 5th element in each row we have a remainder that is not done in a packed vector register with vector instructions. Repeat this for each column. You can see this in the OPT REPORT.  

Here's our loop of interest:

```

   19 !S = S / maxval(abs(S))

   20 do col=1,5

   21  do row=1,5

   22  S(row,col) = S(row,col) / 1.2409463E+22

   23  end do

   24 end do


```

and the opt-report on that loop

```

LOOP BEGIN at test3.f90(20,1)

  remark #15542: loop was not vectorized: inner loop was already vectorized


  LOOP BEGIN at test3.f90(21,3)

   remark #15389: vectorization support: reference s(row,col) has unaligned access  [ test3.f90(22,4) ]

   remark #15389: vectorization support: reference s(row,col) has unaligned access  [ test3.f90(22,4) ]

   remark #15381: vectorization support: unaligned access used inside loop body

   remark #15305: vectorization support: vector length 4

   remark #15309: vectorization support: normalized vectorization overhead 0.158

   remark #15300: LOOP WAS VECTORIZED

   remark #15450: unmasked unaligned unit stride loads: 1

   remark #15451: unmasked unaligned unit stride stores: 1

   remark #15475: --- begin vector cost summary ---

   remark #15476: scalar cost: 30

   remark #15477: vector cost: 9.500

   remark #15478: estimated potential speedup: 2.020

   remark #15486: divides: 1

   remark #15488: --- end vector cost summary ---

   remark #25015: Estimate of max trip count of loop=1

  LOOP END


  LOOP BEGIN at test3.f90(21,3)

  <Remainder loop for vectorization>

   remark #25436: completely unrolled by 1

  LOOP END

LOOP END

```

Reading this, lots of good stuff. It's vectorizing the inner loop, the 5 elements in a column. But of course the XMM registers process 128 bits or 4 elements. So there is a leftover, which is done in differently that the other 4. Note also that we didn't use -align array32byte and tag the loop as aligned, which might help efficiency here. but that is just an additional optimization we could do manually, not necessary for this discussion.

 

In the original example, element (3,5) is in a vector mode. Element (5,3) is a remainder, not in a vector mode. Examining the assembly with both cases with and without -prec-div you can see the vector mode, specifically this:

<     rcpps   %xmm2, %xmm0                 #22.4

which is a reciprocal vector instruction of lesser precision. This is done for the first 4 elements. The remainder is done with a div. 

<     divss   %xmm1, %xmm3                 #22.4


whereas the code with -prec-div uses div everywhere

>     divps   %xmm1, %xmm2                 #22.4


So - predictable, reproducible results. 


Turning on prec-div OR you can also use -no-vec, same effect.



Zaikun
New Contributor I
1,041 Views

Thank you very much for this wonderful explanation. Everything makes sense now. So the behavior is a side effect of optimization, not a bug.

0 Kudos
andrew_4619
Honored Contributor III
1,021 Views

The combined FP operation and remainder FP operation was my expectation of what was happening, it is good to see that is confirmed now.  I guess the main point of note is that comparisons of real numbers to zero is not a robust way of coding and can bite back!

0 Kudos
Reply