Solved: Re: ifx performance issues

Umar__Sait · ‎11-20-2025

There is a significany drop in performance between ifort and ifx. For Intel CPUs:

Intel(R) Xeon(R) Gold 6246R CPU @ 3.40GHz (16+16=32 core)

Intel ifort 2021.13.1
CFLAGS= -free -warn all -diag-disable=10448 -nogen-interfaces -no-prec-div -O3 -fp-model=fast=2 -xHost
real 124m20.650s
user 3049m40.783s
sys 10m40.430s

Intel ifx 2025.3.0
CFLAGS= -free -warn all -nogen-interfaces -O3 -xHost -qopenmp
real 162m36.896s
user 3966m5.817s
sys 12m34.195s

albeit faster than gfortran at 216m. The problem is double precision complex algebra. The code

is also using openmp.

In addition, ifort runs much faster on AMD CPU:

AMD Ryzen Threadripper PRO 7975WX @ 4.0GHz 32-Cores

Intel ifort 2021.13.1
CFLAGS= -free -warn all -nogen-interfaces -diag-disable=10448 -Ofast -march=SSE4.2,CORE-AVX2,znver4 -qopenmp

real 67m37.749s
user 1744m37.811s
sys 13m21.562s

Intel ifx 2025.3.0
CFLAGS= -free -warn all -nogen-interfaces -O3 -Ofast -march=znver4 -qopenmp

real 113m5.164s
user 2487m13.638s
sys 13m46.812s

Intel seems to have removed some of the optimization features for AMD processorts from

ifx....can't use SSE4.2,CORE-AVX2 anymore.....

Igor_V_Intel · ‎11-20-2025

Could you please share a code showing this performance drop?

Note that LLVM IR doesn't have complex data types and thus code with complex type algebra is a known issue vs ifort (there Intel had native complex type support on the proprietary IR level). It should be improved in the next major release of ifx.

View solution in original post

Igor_V_Intel · ‎11-20-2025

Could you please share a code showing this performance drop?

Note that LLVM IR doesn't have complex data types and thus code with complex type algebra is a known issue vs ifort (there Intel had native complex type support on the proprietary IR level). It should be improved in the next major release of ifx.

Umar__Sait · ‎11-20-2025

This is a very large nuclear reactions code written in Fortran 95, so it is hard to share. Normally, in some cases the code has to be run for many days, even a week to get the answers. These were shorter runs for timing purposes. But one can see that all llvm based compilers are running slower than ifort. I am looking forward to the improvements you mention and hopefully we can switch to ifx at some point, but as you can see running for 5 days vs 10 days makes a big difference so we will stick with ifort until then.

mecej4O · ‎11-29-2025

If your code is built out of several Fortran source files, you can try the following. Use the Profiler to identify the subprogram in which most of the execution time is spent. Compile the corresponding source file with (a)Ifort /O3 and (b)Ifx /O3. After each compilation, link the EXE, run it and find the time consumed in the slowest subprogram.

Attempt to create a standalone test driver that just calls the slow subprogram with suitable argument values. Build two EXEs, once with Ifx /O2 and another using Ifort /O2. Submit the test source code and any data files needed to Intel as a reproducer.

mecej4O · ‎11-21-2025

The attached program conj11.f90 runs for about 1 second and produces a counterexample for the Euler Conjecture on the sum of fifth powers.

The EXEs generated using IFort consistently run faster than those produced by Ifx. I hope that this example code will help you to make Ifx produce EXEs that are not slower than those produced by Ifort.

Thanks.

JohnNichols · ‎11-21-2025

conj11 performance

Core I7, VS 2022, latest Oneapi, Windows Preview

debug 32 bit == 2.781 seconds 4.5 times slower

debug 64 bit == 1.781 seconds 2.98 times slower

release 32 bit == 1.875 seconds 3.125 times slower

release 64 bit == 0.6 seconds 1 times slower (humour ok)

There is no evidence using a stock standard anything that IFX is slower, for this program it is not.

andrew_4619 · ‎11-21-2025

??? how could you test 32bit on latest Oneapi????

mecej4O · ‎11-29-2025

@JohnNichols : Your response compares Debug vs. Release and 32-bit vs. 64-bit. As always expected, 32-bit exes run slower than 64-bit exes and Debug exes run slower than Release exes. The question was about exes produced with an optimization option such as /O2 or /O3, targeting Release-x64, one exe produced by IFort, another exe produce3d by IFX.

For example, in a 64-bit OneAPI session, do Ifort /O2 /FeOne.exe and then Ifx /O2 /FeTwo.exe. Now, run and time One.exe and Two.exe and compare. If Two.exe took more time, that is unexpected and Intel will investigate why that happened and remedy IFx.

Igor_V_Intel · ‎11-28-2025

Thank you for sharing this code. I can see the performance difference between ifort and ifx on my Linux machine, which has a relatively old CPU architecture (Skylake). I checked on the latest Granite Rapids machine, and the performance there is identical. I will take a closer look on what can be improved in ifx (now it looks like Loop Optimizer behaves differently, there is no vectorization in both cases)

mecej4O · ‎11-28-2025

Here is an even shorter version of the program, in which the innermost loop is replaced by an invocation of FINDLOC. For this version, the EXE produced by IFORT runs slightly faster than the EXE produced by IFX.

JohnNichols · ‎11-22-2025

If you update using the Control Panel it keeps the latest versions of the program you require available, who ever invented it, is neat person. IFORT is not updated now, but with this method it is available and is the last one issued. Intel do not update all programs with each oneapi, it is merely a name, I can test IFORT and IFX in 32 and 64 bit, but IFORT is the dodo bird, but I reported what I had.

I have not seen the latest ONEAPI noted as released, maybe I missed it.

But the timings are interesting, most of the effort is the excellent code.

JohnNichols · ‎11-22-2025

If we can go back to the Euler Conjecture, the result shown when you consider the infinite number of potential solutions to these problems, I suggest means, that the fact that one of them had a zero component for the fifth factor is just a simple bizarre finding of no significance, merely proving that the Math Gods have a sense of humour.

The fact that we can write code to find the numbers in 0.6 seconds is a tribute to coding skill.

Jim:

The transverse and longitudinal accelerations on one bridge is negatively correlated nicely, I was surprised how nice it was.

------------------------------------------------------------------------------------

Ron:

I want to complain, the blasted IFX is so fast with my structural analysis program, program is courtesy of the mecej4 fixes, that I cannot even get the coke bottle to my mouth before it has run. I used to be able to make a cup of tea. Could you pass my complaint to the makers, ask them to slow it up!

Thanks

John

andrew_4619 · ‎11-22-2025

I at least now understand what it said. It would have made more sense to add the words IFORT and IFX in the description otherwise the comparisons made don't say much. Why even make benchmark comparisons in debug, that is for debugging the speed is always slow and not very relevant if we are debugging and have break points.

mecej4O · ‎11-23-2025

It is actually reasonable that IFx, with debug options specified, produces a slower EXE, because the compiler inserts more machine code to do all the checking (array bounds, arrays allocated before use, etc.), than the older compiler did.

JohnNichols · ‎11-23-2025

1. If I had left out debug someone would have asked why, as I almost always code in debug and rarely use release, it seemed normal

2. One API is generic name, if you want to know what version is associated with 2025.3 you have look at the different elements and some have not been updated for a long time. If you update using the Control Panel you lose nothing.

3. IFX release with the Harrison 1973 Structural Program with Eigen solver is blindingly fast, it is actually annoying as I cannot read the blasted write statements as it flashes by. Every time I run this program I thank the math gods for mecej4.

4. I have been looking at the FHA Post Tensioned Box Girder Design Manual - if you want to see the extension of the 1849 Royal Commissioner's UK report on bridge safety to the maximum extent before you start thinking about frequency this is it.

JohnNichols · ‎11-23-2025

@mecej4O , I have been modelling a Post Tensioned Box Girder bridge. It has 50 elements along the deck and it ran in about 10 seconds, I added two elements as a single column and it now takes 20 seconds, would the matrix complexity slow up the eigen solver.

mecej4O · ‎11-25-2025

John, I have to confess that I get confused as to what your question is.

When you mentioned dropping or adding columns, I had to ask whether that was a real steel column added to your bridge, or a matrix column added to a stiffness matrix. Similarly, when you mentioned "elements", was that elements in the matrix or elements in the finite element model? Pity, please!

JohnNichols · ‎11-25-2025

This is the original structural model. It has 48 beams or elements and 49 nodes. It is clearly as narrow a band as one can get. The Harrison program executed in 10 seconds where all the EI in the stiffness matrix are constant. Now the second model adds a column at node 16 and replaces the support. We add a node 6 metres below 16 called node 50.

The bandwidth is no longer slim, and the code takes 20 seconds.

This is from a 2015 structural publication, I cannot see their justification for the last and first beams, but it changes the results for the Bending Moment Diagram from the manual analysis they did earlier which affects the cable drape pattern. My team are talking about issues with these bridges from the 50's and 60's, I am trying to work out what are the real issues and what is causing them., the first step is understanding how the original designers did the design.

Plus bridge loading patterns are nothing like the manuals from our decade of measurements. The reason for the divergence in the loading patterns is these beams have a mixed first mode that includes torsion, the beginning theory from the 1850s is the load and the vibration is just vertical and this is way wrong.

I hope this explains the problem.

jimdempseyatthecove · ‎11-24-2025

What is the performance when you remove two elements?

If the change is to 5 seconds, then this indicates that the runtime is proportional to a power of the number of elements.

If the change is to 10 seconds -small number of seconds, then this indicates that you may have exceeded a cache level.

L2 is about 3x-4x slower than L1, L3 is about 4x--5x slower than L2, RAM is about 5x slower than L3.

Your system may vary from this.

Jim

jimdempseyatthecove · ‎11-25-2025

>>We add a node 6 metres below 16 called node 50.

Is this what you mean?

Jim

TS19 · ‎12-10-2025

Hi