Re: Programs compiled with ifx are slower than compiled with ifort. - Page 2

eliopoulos · ‎11-21-2024

Programs compiled with ifx are slower than compiled with ifort. Is this going to change with the future updates? Speed is very important to me and now ifort has been discontinued.

eliopoulos · ‎12-05-2024

Here they are.

Ron_Green · ‎12-05-2024

Great. Thanks, I will keep working on it today.

Ron_Green · ‎12-05-2024

Try this: add option /Qipo

that is under Optimization, Interprocedural Optimization

I had to edit nlay.txt. the nl set in that was 56, but laminate only has 55 lines. I changed nl to 55. This got the code running.

I am seeing strange results that I have yet to understand.

I modified the code to stop after the ( passes > 21.0 ) so it runs in about 36 to 29 seconds.

On linux I see this

ifort -i8 -O2 -xhost -qmkl FADAS_C1_01.for ; time ./a.out

time: 26.85 seconds

ifx -i8 -O2 -xhost -qmkl FADAS_C1_01.for ; time ./a.out

time: 29.53 seconds

which is about what you are seeing. slower ifx.

Now I add in -ipo to ifx

ifx -i8 -O2 -xhost -qmkl FADAS_C1_01.for ; time ./a.out

time: 21.86 seconds

Now, I try adding ipo to ifort, but it HURTS ifort.

time: 28.68 seconds

IPO helps with inlining, which can also affect vectorization. The Linux server I have is a i5-7600 Kaby Lake, 7th Gen Xeon.

IFX does NEED -ipo since it is not enabled by default. Ifort DOES interprocedural optimization BY DEFAULT at O2 within a source file. Hence, it was getting good inlining and optimal code with just O2. IFX needs it enabled explicitly with the -ipo option or the -flto option (same as).

So much for Linux. I got onto an Windows 11 server with VS 2022. It's older, 6th Gen Xeon Skylake

I go to the command line first I build with similar Windows options

ifx /integer-size:64 /Qxhost /O2 /Qmkl FADAS_C1_01.for

As expected, it's slower

36.2 seconds

ifort with seam, no IPO 33 seconds. faster than ifx

Add /Qipo and ifx runs in 29.8 seconds. Again fastest.

Now the ODD part. I go back to VS and build your project with ifort and ifx. Oddly the ifort and ifx builds default are 36 seconds.

I add /Qipo to ifx and ifort BUT still getting 36 seconds! I can't get the code to budge! Now the Windows build uses /threads and multithreaded debug libs. I tried threaded static and same, can't get improvement.

So why don't you give it a try in VS. Properties Fortran -> Optimization -> Interprocedural Optimization to set /Qipo

What do you see?

Also, what is your host CPU? Hopefully Genuine Intel.

eliopoulos · ‎12-05-2024

I don't understand why your laminate.txt file has 55 lines. The file I uploaded has 56. I run the program on a laptop with an Intel core i7 1260P CPU, Windows 11 Pro and VS 2022. My respective times are:

ifx built with VS and /Qipo

time: 24.922 s

ifx built with command line and /Qipo

time: 18.719 s

ifort built with VS and without /Qipo

time: 14.016 s

My laptop appears to be faster than your servers. ifort remains faster.

Umar__Sait · ‎12-11-2024

Similar slowdown seen on our nuclear TDDFT code, most likely due to double precision complex arithmetic: Stats are:

IFX version 2025.0 with -O3 -Xhost -ipo: 145m4.824s

Ifort version 2024.2 with -O3 -Xhost: 125m17.952s

code used openmp.

Andy59 · ‎12-12-2024

In my case, it is the compiling process for which using the `ifx` is much slower than using the legacy `ifort`. I am using the VS 2022 with Intel OneAPI. Anyone happens to know the reason for that?

Ron_Green · ‎12-19-2024

@Andy59 I have a suspicion about the slow compilation time. First, is the ifort version 2021.13, from the oneapi 2023.2 package? And what version is ifx?

Also, could you share the compilation times for both compilers for the source file where you see the difference?

My suspicion is initializers. Does your code have:
1) DATA statements? or

2) Initialization in type declarations? Like

real, dimension(3) :: point_in_space = [ 0.0, 0.0, 0.0 ]

with a lot of array constructors or type constructors?

or

3) just a LOT of iniitializations :

real :: var1 = 42.0

real :: var2 = 42.0

...

real :: var1000000 = 42.0

We have been working on initializers lately.

Another possible place, though less likely, is in deeply nested USE trees.