Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.
Announcements
FPGA community forums and blogs on community.intel.com are migrating to the new Altera Community and are read-only. For urgent support needs during this transition, please visit the FPGA Design Resources page or contact an Altera Authorized Distributor.
29285 Discussions

Problem with optimisation flag O3 and ifort

amcc1996
Novice
863 Views

Dear community,

 

I am testing a small code with ifort 2021.6.0, which is showing a strange behaviour with the compilation flag -O3.


The code essentially performs matrix multiplication with explicit loops and accumulates the result in another matrix, starting from two random matrices.

 

Compiling with ifort -O3 main.f90 and running the program, the final result comes full of zeros, and it shouldn't. Compiling with -O2 gives the expected behaviour.

 

The terminal output I get in my machine is:

a
3.920868194323862E-007 2.548044275764261E-002 0.352516161261067
0.666914481524251 0.963055531894656 0.838288203465982
0.335355043646496 0.915327203368213 0.795863676652503
b
0.832693143644796 0.345042693116063 0.871183932316783
8.991835668825542E-002 0.888283839684037 0.700978902440147
0.734552583860683 0.300175817923128 4.971772349719251E-002
c
0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000
0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000
0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000

 

Somehow, if I change the order of the loops from jki to any other permutation, it runs perfectly.

 

Lastly, I have also tried with ifx 2025.2.0 and it runs normally.

 

Does anyone have an idea what is going on?

 

program main

implicit none

integer :: nrows, ncols, nintermediate, nrepeat, i
real(8), allocatable, dimension(:,:) :: a, b, c

nrows = 3
ncols = 3
nintermediate = 3
nrepeat = 1000000

allocate(a(nrows, nintermediate))
allocate(b(nintermediate, ncols))
allocate(c(nrows, ncols), source = 0.0d0)

call random_number(a)
call random_number(b)

c = 0.0d0
do i = 1, nrepeat
  call multiply_add_jki_loop(a, b, c)
end do
write(*,*) "a"
write(*,*) a
write(*,*) "b"
write(*,*) b
write(*,*) "c"
write(*,*) c

contains

  subroutine multiply_add_jki_loop(a, b, c)
    real(8), dimension(:,:), intent(in) :: a, b
    real(8), dimension(:,:), intent(inout) :: c
    integer :: i, j, k
    do j = 1, size(c, 2)
      do k = 1, size(a, 2)
        do i = 1, size(c, 1)
          c(i, j) = c(i, j) + a(i, k) * b(k, j)
        end do
      end do
    end do
  end subroutine multiply_add_jki_loop

end program main
1 Solution
Mentzer__Stuart
New Contributor I
756 Views

Good catch! This looks very much like -O3 is "optimizing" out  the whole loop:

  • Printing out the c matrix at the end of every repeat loop you get the correct result.
  • The -O3 run time is unaffected by the nrepeat size.

Unfortunately, we aren't going to get any more ifort fixes/releases.

And some of us need to keep using ifort due to ifx bugs or for 32-bit builds.

View solution in original post

4 Replies
Mentzer__Stuart
New Contributor I
757 Views

Good catch! This looks very much like -O3 is "optimizing" out  the whole loop:

  • Printing out the c matrix at the end of every repeat loop you get the correct result.
  • The -O3 run time is unaffected by the nrepeat size.

Unfortunately, we aren't going to get any more ifort fixes/releases.

And some of us need to keep using ifort due to ifx bugs or for 32-bit builds.

amcc1996
Novice
686 Views

I understand. In any case, I wanted to leave this documented for future reference, if anyone needs.

Thank you for the answer.

0 Kudos
amcc1996
Novice
685 Views

I suspected it would be something around that, but I was far from sure.

Thank you for the answer and recommendation.

0 Kudos
Mentzer__Stuart
New Contributor I
604 Views

Disabling vectorization (-no-vec or /Qvec-) isn't a work-around but I was surprised to find that using any floating point model other than fast is a work-around. I naively assumed that the logic to decide whether or not a loop can be omitted would be independent of the floating point model. I guess it is more complex than that under the hood, and the branch used for the fast floating point model is separate and has this bug. Fun stuff!

0 Kudos
Reply