Solved: Problem with optimisation flag O3 and ifort

amcc1996 · ‎09-10-2025

Dear community,

I am testing a small code with ifort 2021.6.0, which is showing a strange behaviour with the compilation flag -O3.

The code essentially performs matrix multiplication with explicit loops and accumulates the result in another matrix, starting from two random matrices.

Compiling with ifort -O3 main.f90 and running the program, the final result comes full of zeros, and it shouldn't. Compiling with -O2 gives the expected behaviour.

The terminal output I get in my machine is:

a
3.920868194323862E-007 2.548044275764261E-002 0.352516161261067
0.666914481524251 0.963055531894656 0.838288203465982
0.335355043646496 0.915327203368213 0.795863676652503
b
0.832693143644796 0.345042693116063 0.871183932316783
8.991835668825542E-002 0.888283839684037 0.700978902440147
0.734552583860683 0.300175817923128 4.971772349719251E-002
c
0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000
0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000
0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000

Somehow, if I change the order of the loops from jki to any other permutation, it runs perfectly.

Lastly, I have also tried with ifx 2025.2.0 and it runs normally.

Does anyone have an idea what is going on?

program main

implicit none

integer :: nrows, ncols, nintermediate, nrepeat, i
real(8), allocatable, dimension(:,:) :: a, b, c

nrows = 3
ncols = 3
nintermediate = 3
nrepeat = 1000000

allocate(a(nrows, nintermediate))
allocate(b(nintermediate, ncols))
allocate(c(nrows, ncols), source = 0.0d0)

call random_number(a)
call random_number(b)

c = 0.0d0
do i = 1, nrepeat
  call multiply_add_jki_loop(a, b, c)
end do
write(*,*) "a"
write(*,*) a
write(*,*) "b"
write(*,*) b
write(*,*) "c"
write(*,*) c

contains

  subroutine multiply_add_jki_loop(a, b, c)
    real(8), dimension(:,:), intent(in) :: a, b
    real(8), dimension(:,:), intent(inout) :: c
    integer :: i, j, k
    do j = 1, size(c, 2)
      do k = 1, size(a, 2)
        do i = 1, size(c, 1)
          c(i, j) = c(i, j) + a(i, k) * b(k, j)
        end do
      end do
    end do
  end subroutine multiply_add_jki_loop

end program main

Mentzer__Stuart · ‎09-10-2025

Good catch! This looks very much like -O3 is "optimizing" out the whole loop:

Printing out the c matrix at the end of every repeat loop you get the correct result.
The -O3 run time is unaffected by the nrepeat size.

Unfortunately, we aren't going to get any more ifort fixes/releases.

And some of us need to keep using ifort due to ifx bugs or for 32-bit builds.

View solution in original post

Mentzer__Stuart · ‎09-10-2025

Good catch! This looks very much like -O3 is "optimizing" out the whole loop:

Printing out the c matrix at the end of every repeat loop you get the correct result.
The -O3 run time is unaffected by the nrepeat size.

Unfortunately, we aren't going to get any more ifort fixes/releases.

And some of us need to keep using ifort due to ifx bugs or for 32-bit builds.

amcc1996 · ‎09-11-2025

I understand. In any case, I wanted to leave this documented for future reference, if anyone needs.

Thank you for the answer.

amcc1996 · ‎09-11-2025

I suspected it would be something around that, but I was far from sure.

Thank you for the answer and recommendation.

Mentzer__Stuart · ‎09-11-2025

Disabling vectorization (-no-vec or /Qvec-) isn't a work-around but I was surprised to find that using any floating point model other than fast is a work-around. I naively assumed that the logic to decide whether or not a loop can be omitted would be independent of the floating point model. I guess it is more complex than that under the hood, and the branch used for the fast floating point model is separate and has this bug. Fun stuff!