Performance of `ifx -fast` when invariant if conditions in loops.

pedro · ‎02-10-2025

Hello,

I am trying to understand the consequences of replacing CPP macros with loop-invariant `if` conditions. Typically, with optimization on, I see no performance difference in most compilers I have checked, including `ifx` with less aggressive optimization. The exception is `ifx` with the `-fast` option.

Consider the following codes:

program test_cpp
  implicit none
  integer, parameter :: nx = 500, ny = 500, nz = 500
  integer :: i, j, k
  real :: field(nx, ny, nz)
  real :: diff1, diff2, res
  integer :: iunit
  real(8) :: t2,t1
  !
  ! initialize data
  !
  do k = 1, nz
    do j = 1, ny
      do i = 1, nx
        field(i, j, k) = i + j*10. + k*100.
      end do
    end do
  end do
  !
  ! run kernel
  !
  call cpu_time(t1)
  do k = 1, nz
    do j = 1, ny
      do i = 1, nx
        diff1 = field(i, j, k) + 1.
        diff2 = field(i, j, k) - 1.
#if defined(_OPTION_1)
#if defined(_OPTION_1_PLUS)
        res = diff1 + diff2
#else
        res = diff1 * diff2
#endif
#else
        res = diff1 - diff2
#endif
        field(i, j, k) = res
      end do
    end do
  end do
  call cpu_time(t2)
  print*,'Elapsed time: ', t2-t1
  !
  ! save data
  !
  open(newunit=iunit, file="output.bin", form="unformatted", access="stream", status="replace")
  write(iunit) field
  close(iunit)
  print*, 'field(3,3,3) = ', field(3,3,3)
end program test_cpp

and:

program test_nocpp
  implicit none
  integer, parameter :: nx = 500, ny = 500, nz = 500
  integer :: i, j, k
  real :: field(nx, ny, nz)
  real :: diff1, diff2, res
  integer :: iunit
  integer :: option1, option1_plus
  character(len=32) :: arg
  real(8) :: t2,t1
  !
  ! get options from the command line
  !
  if (command_argument_count() >= 1) then
    call get_command_argument(1, arg)
    read(arg, *) option1
  else
    option1 = 0
  endif
  if (command_argument_count() >= 2) then
    call get_command_argument(2, arg)
    read(arg, *) option1_plus
  else
    option1_plus = 0
  endif
  !
  ! initialize data
  !
  do k = 1, nz
    do j = 1, ny
      do i = 1, nx
        field(i, j, k) = i + j*10. + k*100.
      end do
    end do
  end do
  !
  ! run kernel
  !
  call cpu_time(t1)
  do k = 1, nz
    do j = 1, ny
      do i = 1, nx
        diff1 = field(i, j, k) + 1.
        diff2 = field(i, j, k) - 1.
        if (option1 == 1) then
          if (option1_plus == 1) then
            res = diff1 + diff2
          else
            res = diff1 * diff2
          endif
        else
          res = diff1 - diff2
        endif
        field(i, j, k) = res
      end do
    end do
  end do
  call cpu_time(t2)
  print*,'Elapsed time: ', t2-t1
  !
  ! save data
  !
  open(newunit=iunit, file="output.bin", form="unformatted", access="stream", status="replace")
  write(iunit) field
  close(iunit)
  print*, 'field(3,3,3) = ', field(3,3,3)
end program test_nocpp

I do not get a performance difference when compiling with `-O2` and `-O3`, but I do it with `-fast:

$ ifx -fast test_nocpp.f90 -o test_nocpp && ./test_nocpp 1 1 && ifx -fast -cpp -D_OPTION_1 -D_OPTION_1_PLUS test_cpp.f90 -o test_cpp && ./test_cpp
ld: /home/x/software/intel/oneapi/compiler/2025.0/lib/libifcoremt.a(for_close_proc.o): in function `for__close_proc':
for_close_proc.c:(.text+0x1c1): warning: Using 'dlopen' in statically linked applications requires at runtime the shared libraries from the glibc version used for linking
Elapsed time: 5.109700000000000E-002
field(3,3,3) = 666.0000
ld: /home/x/software/intel/oneapi/compiler/2025.0/lib/libifcoremt.a(for_close_proc.o): in function `for__close_proc':
for_close_proc.c:(.text+0x1c1): warning: Using 'dlopen' in statically linked applications requires at runtime the shared libraries from the glibc version used for linking
Elapsed time: 4.904699999999998E-002
field(3,3,3) = 666.0000

ifx -O3 test_nocpp.f90 -o test_nocpp && ./test_nocpp 1 1 && ifx -O3 -cpp -D_OPTION_1 -D_OPTION_1_PLUS test_cpp.f90 -o test_cpp && ./test_cpp
Elapsed time: 4.818600000000001E-002
field(3,3,3) = 666.0000
Elapsed time: 4.819499999999999E-002
field(3,3,3) = 666.0000

Why isn't `-fast` able to optimize the code with loop-invariant if conditions as efficiently as the code with CPP macros? Is there a flag I can pass to recover the performance of `test_cpp.f90`

Thanks!

pedro · ‎02-11-2025

To add to this, I found in the compiler reference that `-fast` is equivalent to `-ipo, -O3, -static, -fp-model fast`.

However, compiling with `-ipo, -O3, -static, -fp-model fast`, I get the same performance for both implementations:

ifx -ipo -O3 -static -fp-model fast test_nocpp.f90 -o test_nocpp && ./test_nocpp 1 1 && ifx -ipo -O3 -static -fp-model fast -cpp -D_OPTION_1 -D_OPTION_1_PLUS test_cpp.f90 -o test_cpp && ./test_cpp
ld: /home/x/software/intel/oneapi/compiler/2025.0/lib/libifcoremt.a(for_close_proc.o): in function `for__close_proc':
for_close_proc.c:(.text+0x1c1): warning: Using 'dlopen' in statically linked applications requires at runtime the shared libraries from the glibc version used for linking
 Elapsed time:   4.875399999999999E-002
 field(3,3,3) =    666.0000
ld: /home/x/software/intel/oneapi/compiler/2025.0/lib/libifcoremt.a(for_close_proc.o): in function `for__close_proc':
for_close_proc.c:(.text+0x1c1): warning: Using 'dlopen' in statically linked applications requires at runtime the shared libraries from the glibc version used for linking
 Elapsed time:   4.877200000000001E-002
 field(3,3,3) =    666.0000

I thought `-fast` was short for `-ipo, -O3, -static, -fp-model fast`, but I see quite a difference in performance between the two.

I am using ifx 2025.0.4 20241205.

It would be nice to get some clarity on this. Thanks in advance!

Ron_Green · ‎02-11-2025

Several comments here that may help.

First, you don't need to use cpp. You can use fpp. If you name the file with extension .F90 you get fpp invoked by default. fpp parses the #if defines just like cpp. you can confirm by using the -keep option and checking the output .i90 file for the preprocessed source.

example

ifx -O2 -D_OPTION_1 -D_OPTION_1_PLUS test_fpp.F90 -keep

more test_fpp.i90

! run kernel
!
  call cpu_time(t1)
  do k = 1, nz
    do j = 1, ny
      do i = 1, nx
        diff1 = field(i, j, k) + 1.
        diff2 = field(i, j, k) - 1.


        res = diff1 + diff2
# 33

# 36

        field(i, j, k) = res
      end do
    end do
  end do
  call cpu_time(t2)

Now, onto the performance questions

1) You can use -qopt-report 3 to generate optimization reports to compare versions, options, etc. compile with

-qopt-report-3

option. Then look for the <file>.optrpt. compare the opt reports for different options and defines.

2) you can dump assembly language and remove any doubt about generated code

ifx -O2 -D_OPTION_1 -D_OPTION_1_PLUS test_fpp.F90 -S

look for test_fpp.s for the assembly version of your code

3) You can look at all defines and options the compiler uses with option

-#

or

-dryrun

4) options and what not: ifx uses default LLVM optimizations for O1 - O2. To trigger additional optimizations you can add

-xhost

if you have a genuine Intel Processor. This kicks in additional optimizations. -ipo will help with that also, as will -flto

I think this will give you the tools you need for your analysis.

pedro · ‎02-12-2025

Thanks!

I wonder if you have any idea why `-fast` performs poorer than `-ipo, -O3, -static, -fp-model fast`. From the docs, I read that they're equivalent...

Pedro

Steve_Lionel · ‎02-13-2025

It's not documented, but -fast also implies -xHost. Perhaps the advanced instructions are not optimal for your application. (This really should be documented!)

pedro · ‎02-14-2025

Ha... Indeed, I think this was also `ifort`'s behavior, so it makes sense. Thank you, Steve!