- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
(Update: There is an informative discussion on Fortran Discourse about this bug)
Consider the following code.
! test.f90
program test
implicit none
! Why these strange numbers? Because they come from a real project.
real, parameter :: A(25) = &
& [1.2409463E+22, 2.8192031E+20, -4.4971432E+21, 1.1558917E+22, -4.7725394E+21, &
& 2.8192031E+20, 2.8192031E+20, -3.2276437E+21, -3.3936284E+21, 4.1412172E+21, &
& -4.4971432E+21, -3.2276437E+21, -7.6997325E+20, -4.1627641E+21, -3.4729614E+20,&
& 1.1558917E+22, -3.3936284E+21, -4.1627641E+21, 2.5757004E+21, -2.4898961E+21, &
& -4.7725394E+21, 4.1412172E+21, -3.4729614E+20, -2.4898961E+21, 3.5042103E+21]
real :: S(5, 5)
S = reshape(A, [5, 5])
! Is S symmetric?
write (*, *) all(abs(S - transpose(S)) <= 0)
! The numbers in S are horrible. Let's scale it.
S = S / maxval(abs(S))
write (*, *) S
! Is S symmetric?
write (*, *) all(abs(S - transpose(S)) <= 0)
end program test
Here is what happens on my computer.
$ uname -a && ifort --version
Linux 5.15.0-53-generic #59-Ubuntu SMP Mon Oct 17 18:53:30 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
ifort (IFORT) 2021.7.1 20221019
Copyright (C) 1985-2022 Intel Corporation. All rights reserved.
$ ifort test.f90 && ./a.out
T
0.9999999 2.2718171E-02 -0.3623963 0.9314599 -0.3845887
2.2718171E-02 2.2718171E-02 -0.2600954 -0.2734710 0.3337145
-0.3623963 -0.2600954 -6.2047265E-02 -0.3354508 -2.7986396E-02
0.9314599 -0.2734710 -0.3354508 0.2075594 -0.2006450
-0.3845887 0.3337145 -2.7986394E-02 -0.2006450 0.2823821
F
Note that S(3,5) /= S(5,3) after the scaling. This happens when ifort is invoked with -O, -O2, -O3, -O4, or -O5, but not -g, -O0, -O1, or -Ofast.
Is this expected? It seems like a bug to me.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Posted this to Fortran Discourse https://fortran-lang.discourse.group/t/strange-behavior-of-ifort/4846/47
I have some insight into this issue. The mathematics is actually reproducible. The insight came from changing the data in the 5x5, and simplifying the scaling to a simple division. Then using -qopt-report=5 and -S and examining the data in both hex and default float format
Here is my modified test
```
program test
implicit none
integer :: row , col
! Why these strange numbers? Because they come from a real project.
real, parameter :: A(25) = &
& [-3.4729614E+20, -3.4729614E+20, -3.4729614E+20, -3.4729614E+20, -3.4729614E+20, &
& -3.4729614E+20, -3.4729614E+20, -3.4729614E+20, -3.4729614E+20, -3.4729614E+20, &
& -3.4729614E+20, -3.4729614E+20, -3.4729614E+20, -3.4729614E+20, -3.4729614E+20,&
& -3.4729614E+20, -3.4729614E+20, -3.4729614E+20, -3.4729614E+20, -3.4729614E+20, &
& -3.4729614E+20, -3.4729614E+20, -3.4729614E+20, -3.4729614E+20, -3.4729614E+20]
real :: S(5, 5)
S = reshape(A, [5, 5])
! The numbers in S are horrible. Let's scale it.
!S = S / maxval(abs(S))
do col=1,5
do row=1,5
S(row,col) = S(row,col) / 1.2409463E+22
end do
end do
write (*,*) "S after"
do row=1,5
write(*,'(5(Z8,3x))') S(row,:)
end do
print *, ""
write (*, *) S
! Is S symmetric?
write (*, *)'Is S symmetric?', all(abs(S - transpose(S)) <= 0)
end program test
```
Now just run a simple compilation
```
]$ ifort test3.f90
$ ./a.out
S after
BCE543B9 BCE543B9 BCE543B9 BCE543B9 BCE543B9
BCE543B9 BCE543B9 BCE543B9 BCE543B9 BCE543B9
BCE543B9 BCE543B9 BCE543B9 BCE543B9 BCE543B9
BCE543B9 BCE543B9 BCE543B9 BCE543B9 BCE543B9
BCE543BA BCE543BA BCE543BA BCE543BA BCE543BA
-2.7986394E-02 -2.7986394E-02 -2.7986394E-02 -2.7986394E-02 -2.7986396E-02
-2.7986394E-02 -2.7986394E-02 -2.7986394E-02 -2.7986394E-02 -2.7986396E-02
-2.7986394E-02 -2.7986394E-02 -2.7986394E-02 -2.7986394E-02 -2.7986396E-02
-2.7986394E-02 -2.7986394E-02 -2.7986394E-02 -2.7986394E-02 -2.7986396E-02
-2.7986394E-02 -2.7986394E-02 -2.7986394E-02 -2.7986394E-02 -2.7986396E-02
Is S symmetric? F
```
Do you see the pattern? It is stochastic. Not random.
Now, with inside knowledge I know that default, no -O option, gives us -O2 which vectorizes with SSE2, 128 bit registers. You can also verify this by using -dryrun:
```
$ ifort test3.f90 -dryrun |& grep -i sse
-D__SSE__ \
-D__SSE_MATH__ \
-D__SSE2__ \
-D__SSE2_MATH__ \
```
Here we are telling the backend it MAY use SSE2. Not demanding, but suggesting to the backend. You'll see the same define in the -prec-div case - but remember, it's a suggestion not a directive. It may use SSE2, that is all the define is doing. the backend will see -prec-div and decide how to process the loop nest.
How many 32bit data items in the XMM registers? 4. Go down the row, in memory order - see the pattern? at optimization, we're packing 4 elements into an XMM register and doing vectorized math. For the 5th element in each row we have a remainder that is not done in a packed vector register with vector instructions. Repeat this for each column. You can see this in the OPT REPORT.
Here's our loop of interest:
```
19 !S = S / maxval(abs(S))
20 do col=1,5
21 do row=1,5
22 S(row,col) = S(row,col) / 1.2409463E+22
23 end do
24 end do
```
and the opt-report on that loop
```
LOOP BEGIN at test3.f90(20,1)
remark #15542: loop was not vectorized: inner loop was already vectorized
LOOP BEGIN at test3.f90(21,3)
remark #15389: vectorization support: reference s(row,col) has unaligned access [ test3.f90(22,4) ]
remark #15389: vectorization support: reference s(row,col) has unaligned access [ test3.f90(22,4) ]
remark #15381: vectorization support: unaligned access used inside loop body
remark #15305: vectorization support: vector length 4
remark #15309: vectorization support: normalized vectorization overhead 0.158
remark #15300: LOOP WAS VECTORIZED
remark #15450: unmasked unaligned unit stride loads: 1
remark #15451: unmasked unaligned unit stride stores: 1
remark #15475: --- begin vector cost summary ---
remark #15476: scalar cost: 30
remark #15477: vector cost: 9.500
remark #15478: estimated potential speedup: 2.020
remark #15486: divides: 1
remark #15488: --- end vector cost summary ---
remark #25015: Estimate of max trip count of loop=1
LOOP END
LOOP BEGIN at test3.f90(21,3)
<Remainder loop for vectorization>
remark #25436: completely unrolled by 1
LOOP END
LOOP END
```
Reading this, lots of good stuff. It's vectorizing the inner loop, the 5 elements in a column. But of course the XMM registers process 128 bits or 4 elements. So there is a leftover, which is done in differently that the other 4. Note also that we didn't use -align array32byte and tag the loop as aligned, which might help efficiency here. but that is just an additional optimization we could do manually, not necessary for this discussion.
In the original example, element (3,5) is in a vector mode. Element (5,3) is a remainder, not in a vector mode. Examining the assembly with both cases with and without -prec-div you can see the vector mode, specifically this:
< rcpps %xmm2, %xmm0 #22.4
which is a reciprocal vector instruction of lesser precision. This is done for the first 4 elements. The remainder is done with a div.
< divss %xmm1, %xmm3 #22.4
whereas the code with -prec-div uses div everywhere
> divps %xmm1, %xmm2 #22.4
So - predictable, reproducible results.
Turning on prec-div OR you can also use -no-vec, same effect.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dear Intel Team members,
Here is a comment re: -fp:fast compiler option with IFORT versus other options at the Fortran DIscourse site that may be of interest here.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Intel Forum Admin: is this thread better suited for the Fortran forum since it appears a matter with IFORT product in HPC toolkit? If so, can you please move it to
https://community.intel.com/t5/Intel-Fortran-Compiler/bd-p/fortran-compiler
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'm reading the discussion on Fortran Discourse. I will investigate.
ron
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
This is an interesting little example for many reasons. Where to begin. Let's start with some simple things
1. Intel compilers when invoked without an explicit -O option defaults to O2. And Intel's O2 is aggressive.
2. Optimizations on most compilers come at the expense of precision. You have to balance your needs for highest FP precision and performance.
3. Division, amongst other mathematical expressions and intrinsics, may be done differently depending on optimization level.
4. Every compiler does optimization differently. Does not mean one is 'right' and the other 'wrong'. Just different choices in balancing precision and speed.
Skip to the solution: For IFORT use
-prec-div
option to force precise divisions even under optimzation. Read up on this option [HERE](https://www.intel.com/content/www/us/en/develop/documentation/fortran-compiler-oneapi-dev-guide-and-reference/top/compiler-reference/compiler-options/floating-point-options/prec-div-qprec-div.html). -fp-model is a macro option that sets a number of different FP behaviors. PRECISE triggers -prec-div amongst other behaviors.
You can read on if you want more details.
I'll start with #4. As noted, IFX and IFORT behave differently with this example. Our LLVM-based IFX optimization is radically different than that employed with our legacy proprietary IFORT compiler. You SHOULD expect possible FP differences between these 2 compilers under optimizations. Note in the documentation on -prec-div that this is only an option for IFORT. Why? LLVM FP behavior is different than on our Classic proprietary compiler.
On point #1 above: When you encounter questions about 'what is the compiler doing?' there are easier ways than dumping assembly and reading that. First, use
-dryrun
option. This will show you all internal defines and options passed to the compiler. It's quite helpful.
Next, with IFORT use the
-qopt-report=5
option and examine the .optrpt output.
On Point # 3: Now doing this, and comparing the differences between -prec-div and not at O2 you can see the "Code gen" or code generator is different - difference in number of temporary values ("Locals"). This was a clear indicator that strategy used by the code generator for the division is quite different. Again, read the -O2 and -prec-div options in the [Developer Guide](https://www.intel.com/content/www/us/en/develop/documentation/fortran-compiler-oneapi-dev-guide-and-reference/top/compiler-reference/compiler-options/alphabetical-option-list.html)
On point #1, O2 default. Intel prioritizes performance by default "out of box", favoring that over numerical precision when you do not bother to add additional compiler options (if you don't give the compiler any -O options). This philosophy can and has been argued in the community. Some ASSUME that without -O that you would get O0. IFORT has been around for a lot of years and is widely adopted. Where possible, we try to make IFX "out of the box" as close to behavior of IFORT as possible. So no surprises for existing customers, O2 is default. But if you are coming from FLang or gfortran then this default of O2 may be a surprise to you.
Thanks for showing us this example. It was quite interesting.
ron
#IAmIntel
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>Note in the documentation on -prec-div that this is only an option for IFORT.
Does this imply ifx uses the equivalent precision as ifort with -prec-div? .OR. not?
If not, is there a similar functioning option for ifx as -prec-div for ifort?
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Posted this to Fortran Discourse https://fortran-lang.discourse.group/t/strange-behavior-of-ifort/4846/47
I have some insight into this issue. The mathematics is actually reproducible. The insight came from changing the data in the 5x5, and simplifying the scaling to a simple division. Then using -qopt-report=5 and -S and examining the data in both hex and default float format
Here is my modified test
```
program test
implicit none
integer :: row , col
! Why these strange numbers? Because they come from a real project.
real, parameter :: A(25) = &
& [-3.4729614E+20, -3.4729614E+20, -3.4729614E+20, -3.4729614E+20, -3.4729614E+20, &
& -3.4729614E+20, -3.4729614E+20, -3.4729614E+20, -3.4729614E+20, -3.4729614E+20, &
& -3.4729614E+20, -3.4729614E+20, -3.4729614E+20, -3.4729614E+20, -3.4729614E+20,&
& -3.4729614E+20, -3.4729614E+20, -3.4729614E+20, -3.4729614E+20, -3.4729614E+20, &
& -3.4729614E+20, -3.4729614E+20, -3.4729614E+20, -3.4729614E+20, -3.4729614E+20]
real :: S(5, 5)
S = reshape(A, [5, 5])
! The numbers in S are horrible. Let's scale it.
!S = S / maxval(abs(S))
do col=1,5
do row=1,5
S(row,col) = S(row,col) / 1.2409463E+22
end do
end do
write (*,*) "S after"
do row=1,5
write(*,'(5(Z8,3x))') S(row,:)
end do
print *, ""
write (*, *) S
! Is S symmetric?
write (*, *)'Is S symmetric?', all(abs(S - transpose(S)) <= 0)
end program test
```
Now just run a simple compilation
```
]$ ifort test3.f90
$ ./a.out
S after
BCE543B9 BCE543B9 BCE543B9 BCE543B9 BCE543B9
BCE543B9 BCE543B9 BCE543B9 BCE543B9 BCE543B9
BCE543B9 BCE543B9 BCE543B9 BCE543B9 BCE543B9
BCE543B9 BCE543B9 BCE543B9 BCE543B9 BCE543B9
BCE543BA BCE543BA BCE543BA BCE543BA BCE543BA
-2.7986394E-02 -2.7986394E-02 -2.7986394E-02 -2.7986394E-02 -2.7986396E-02
-2.7986394E-02 -2.7986394E-02 -2.7986394E-02 -2.7986394E-02 -2.7986396E-02
-2.7986394E-02 -2.7986394E-02 -2.7986394E-02 -2.7986394E-02 -2.7986396E-02
-2.7986394E-02 -2.7986394E-02 -2.7986394E-02 -2.7986394E-02 -2.7986396E-02
-2.7986394E-02 -2.7986394E-02 -2.7986394E-02 -2.7986394E-02 -2.7986396E-02
Is S symmetric? F
```
Do you see the pattern? It is stochastic. Not random.
Now, with inside knowledge I know that default, no -O option, gives us -O2 which vectorizes with SSE2, 128 bit registers. You can also verify this by using -dryrun:
```
$ ifort test3.f90 -dryrun |& grep -i sse
-D__SSE__ \
-D__SSE_MATH__ \
-D__SSE2__ \
-D__SSE2_MATH__ \
```
Here we are telling the backend it MAY use SSE2. Not demanding, but suggesting to the backend. You'll see the same define in the -prec-div case - but remember, it's a suggestion not a directive. It may use SSE2, that is all the define is doing. the backend will see -prec-div and decide how to process the loop nest.
How many 32bit data items in the XMM registers? 4. Go down the row, in memory order - see the pattern? at optimization, we're packing 4 elements into an XMM register and doing vectorized math. For the 5th element in each row we have a remainder that is not done in a packed vector register with vector instructions. Repeat this for each column. You can see this in the OPT REPORT.
Here's our loop of interest:
```
19 !S = S / maxval(abs(S))
20 do col=1,5
21 do row=1,5
22 S(row,col) = S(row,col) / 1.2409463E+22
23 end do
24 end do
```
and the opt-report on that loop
```
LOOP BEGIN at test3.f90(20,1)
remark #15542: loop was not vectorized: inner loop was already vectorized
LOOP BEGIN at test3.f90(21,3)
remark #15389: vectorization support: reference s(row,col) has unaligned access [ test3.f90(22,4) ]
remark #15389: vectorization support: reference s(row,col) has unaligned access [ test3.f90(22,4) ]
remark #15381: vectorization support: unaligned access used inside loop body
remark #15305: vectorization support: vector length 4
remark #15309: vectorization support: normalized vectorization overhead 0.158
remark #15300: LOOP WAS VECTORIZED
remark #15450: unmasked unaligned unit stride loads: 1
remark #15451: unmasked unaligned unit stride stores: 1
remark #15475: --- begin vector cost summary ---
remark #15476: scalar cost: 30
remark #15477: vector cost: 9.500
remark #15478: estimated potential speedup: 2.020
remark #15486: divides: 1
remark #15488: --- end vector cost summary ---
remark #25015: Estimate of max trip count of loop=1
LOOP END
LOOP BEGIN at test3.f90(21,3)
<Remainder loop for vectorization>
remark #25436: completely unrolled by 1
LOOP END
LOOP END
```
Reading this, lots of good stuff. It's vectorizing the inner loop, the 5 elements in a column. But of course the XMM registers process 128 bits or 4 elements. So there is a leftover, which is done in differently that the other 4. Note also that we didn't use -align array32byte and tag the loop as aligned, which might help efficiency here. but that is just an additional optimization we could do manually, not necessary for this discussion.
In the original example, element (3,5) is in a vector mode. Element (5,3) is a remainder, not in a vector mode. Examining the assembly with both cases with and without -prec-div you can see the vector mode, specifically this:
< rcpps %xmm2, %xmm0 #22.4
which is a reciprocal vector instruction of lesser precision. This is done for the first 4 elements. The remainder is done with a div.
< divss %xmm1, %xmm3 #22.4
whereas the code with -prec-div uses div everywhere
> divps %xmm1, %xmm2 #22.4
So - predictable, reproducible results.
Turning on prec-div OR you can also use -no-vec, same effect.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you very much for this wonderful explanation. Everything makes sense now. So the behavior is a side effect of optimization, not a bug.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The combined FP operation and remainder FP operation was my expectation of what was happening, it is good to see that is confirmed now. I guess the main point of note is that comparisons of real numbers to zero is not a robust way of coding and can bite back!
![](/skins/images/9701FEC4E90D143B667C09F3DB2F340F/responsive_peak/images/icon_anonymous_message.png)
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page