- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
I am trying to understand the consequences of replacing CPP macros with loop-invariant `if` conditions. Typically, with optimization on, I see no performance difference in most compilers I have checked, including `ifx` with less aggressive optimization. The exception is `ifx` with the `-fast` option.
Consider the following codes:
program test_cpp
implicit none
integer, parameter :: nx = 500, ny = 500, nz = 500
integer :: i, j, k
real :: field(nx, ny, nz)
real :: diff1, diff2, res
integer :: iunit
real(8) :: t2,t1
!
! initialize data
!
do k = 1, nz
do j = 1, ny
do i = 1, nx
field(i, j, k) = i + j*10. + k*100.
end do
end do
end do
!
! run kernel
!
call cpu_time(t1)
do k = 1, nz
do j = 1, ny
do i = 1, nx
diff1 = field(i, j, k) + 1.
diff2 = field(i, j, k) - 1.
#if defined(_OPTION_1)
#if defined(_OPTION_1_PLUS)
res = diff1 + diff2
#else
res = diff1 * diff2
#endif
#else
res = diff1 - diff2
#endif
field(i, j, k) = res
end do
end do
end do
call cpu_time(t2)
print*,'Elapsed time: ', t2-t1
!
! save data
!
open(newunit=iunit, file="output.bin", form="unformatted", access="stream", status="replace")
write(iunit) field
close(iunit)
print*, 'field(3,3,3) = ', field(3,3,3)
end program test_cpp
and:
program test_nocpp
implicit none
integer, parameter :: nx = 500, ny = 500, nz = 500
integer :: i, j, k
real :: field(nx, ny, nz)
real :: diff1, diff2, res
integer :: iunit
integer :: option1, option1_plus
character(len=32) :: arg
real(8) :: t2,t1
!
! get options from the command line
!
if (command_argument_count() >= 1) then
call get_command_argument(1, arg)
read(arg, *) option1
else
option1 = 0
endif
if (command_argument_count() >= 2) then
call get_command_argument(2, arg)
read(arg, *) option1_plus
else
option1_plus = 0
endif
!
! initialize data
!
do k = 1, nz
do j = 1, ny
do i = 1, nx
field(i, j, k) = i + j*10. + k*100.
end do
end do
end do
!
! run kernel
!
call cpu_time(t1)
do k = 1, nz
do j = 1, ny
do i = 1, nx
diff1 = field(i, j, k) + 1.
diff2 = field(i, j, k) - 1.
if (option1 == 1) then
if (option1_plus == 1) then
res = diff1 + diff2
else
res = diff1 * diff2
endif
else
res = diff1 - diff2
endif
field(i, j, k) = res
end do
end do
end do
call cpu_time(t2)
print*,'Elapsed time: ', t2-t1
!
! save data
!
open(newunit=iunit, file="output.bin", form="unformatted", access="stream", status="replace")
write(iunit) field
close(iunit)
print*, 'field(3,3,3) = ', field(3,3,3)
end program test_nocpp
I do not get a performance difference when compiling with `-O2` and `-O3`, but I do it with `-fast:
$ ifx -fast test_nocpp.f90 -o test_nocpp && ./test_nocpp 1 1 && ifx -fast -cpp -D_OPTION_1 -D_OPTION_1_PLUS test_cpp.f90 -o test_cpp && ./test_cpp
ld: /home/x/software/intel/oneapi/compiler/2025.0/lib/libifcoremt.a(for_close_proc.o): in function `for__close_proc':
for_close_proc.c:(.text+0x1c1): warning: Using 'dlopen' in statically linked applications requires at runtime the shared libraries from the glibc version used for linking
Elapsed time: 5.109700000000000E-002
field(3,3,3) = 666.0000
ld: /home/x/software/intel/oneapi/compiler/2025.0/lib/libifcoremt.a(for_close_proc.o): in function `for__close_proc':
for_close_proc.c:(.text+0x1c1): warning: Using 'dlopen' in statically linked applications requires at runtime the shared libraries from the glibc version used for linking
Elapsed time: 4.904699999999998E-002
field(3,3,3) = 666.0000
ifx -O3 test_nocpp.f90 -o test_nocpp && ./test_nocpp 1 1 && ifx -O3 -cpp -D_OPTION_1 -D_OPTION_1_PLUS test_cpp.f90 -o test_cpp && ./test_cpp
Elapsed time: 4.818600000000001E-002
field(3,3,3) = 666.0000
Elapsed time: 4.819499999999999E-002
field(3,3,3) = 666.0000
Why isn't `-fast` able to optimize the code with loop-invariant if conditions as efficiently as the code with CPP macros? Is there a flag I can pass to recover the performance of `test_cpp.f90`
Thanks!
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
To add to this, I found in the compiler reference that `-fast` is equivalent to `-ipo, -O3, -static, -fp-model fast`.
However, compiling with `-ipo, -O3, -static, -fp-model fast`, I get the same performance for both implementations:
ifx -ipo -O3 -static -fp-model fast test_nocpp.f90 -o test_nocpp && ./test_nocpp 1 1 && ifx -ipo -O3 -static -fp-model fast -cpp -D_OPTION_1 -D_OPTION_1_PLUS test_cpp.f90 -o test_cpp && ./test_cpp
ld: /home/x/software/intel/oneapi/compiler/2025.0/lib/libifcoremt.a(for_close_proc.o): in function `for__close_proc':
for_close_proc.c:(.text+0x1c1): warning: Using 'dlopen' in statically linked applications requires at runtime the shared libraries from the glibc version used for linking
Elapsed time: 4.875399999999999E-002
field(3,3,3) = 666.0000
ld: /home/x/software/intel/oneapi/compiler/2025.0/lib/libifcoremt.a(for_close_proc.o): in function `for__close_proc':
for_close_proc.c:(.text+0x1c1): warning: Using 'dlopen' in statically linked applications requires at runtime the shared libraries from the glibc version used for linking
Elapsed time: 4.877200000000001E-002
field(3,3,3) = 666.0000
I thought `-fast` was short for `-ipo, -O3, -static, -fp-model fast`, but I see quite a difference in performance between the two.
I am using ifx 2025.0.4 20241205.
It would be nice to get some clarity on this. Thanks in advance!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Several comments here that may help.
First, you don't need to use cpp. You can use fpp. If you name the file with extension .F90 you get fpp invoked by default. fpp parses the #if defines just like cpp. you can confirm by using the -keep option and checking the output .i90 file for the preprocessed source.
example
ifx -O2 -D_OPTION_1 -D_OPTION_1_PLUS test_fpp.F90 -keep
more test_fpp.i90
! run kernel
!
call cpu_time(t1)
do k = 1, nz
do j = 1, ny
do i = 1, nx
diff1 = field(i, j, k) + 1.
diff2 = field(i, j, k) - 1.
res = diff1 + diff2
# 33
# 36
field(i, j, k) = res
end do
end do
end do
call cpu_time(t2)
Now, onto the performance questions
1) You can use -qopt-report 3 to generate optimization reports to compare versions, options, etc. compile with
-qopt-report-3
option. Then look for the <file>.optrpt. compare the opt reports for different options and defines.
2) you can dump assembly language and remove any doubt about generated code
ifx -O2 -D_OPTION_1 -D_OPTION_1_PLUS test_fpp.F90 -S
look for test_fpp.s for the assembly version of your code
3) You can look at all defines and options the compiler uses with option
-#
or
-dryrun
4) options and what not: ifx uses default LLVM optimizations for O1 - O2. To trigger additional optimizations you can add
-xhost
if you have a genuine Intel Processor. This kicks in additional optimizations. -ipo will help with that also, as will -flto
I think this will give you the tools you need for your analysis.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks!
I wonder if you have any idea why `-fast` performs poorer than `-ipo, -O3, -static, -fp-model fast`. From the docs, I read that they're equivalent...
Pedro
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It's not documented, but -fast also implies -xHost. Perhaps the advanced instructions are not optimal for your application. (This really should be documented!)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Ha... Indeed, I think this was also `ifort`'s behavior, so it makes sense. Thank you, Steve!

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page