Solved: PARDISO performance fluctuation on INTEL processor

NicolaGiuliani · ‎05-30-2025

Hi all,

my name is Nicola Giuliani and I am a Software Engineering working in the field of numerical analysis.

In the past days I have been testing the PARDISO solver on a sparse matrix with 486k rows/cols and 3.3M non zeros. I am storing it in CSR format and calling PARDISO to solve it. I am compiling the program (that creates the CSR matrix and then factorizes it). I have tested three different scenarios using both intel an gcc (9.2 and 10.2). I use 1 2 4 8 16 32 64 threads to see the scalability of the solution. I am using Intel oneAPI version 2025.1.1.

icpx kernel_only_pardiso.cpp -qmkl=parallel -o intel_intel.out

g++ kernel_only_pardiso.cpp -m64 -L${MKLROOT}/lib -Wl,--no-as-needed -lmkl_intel_lp64 -lmkl_gnu_thread -lmkl_core -lgomp -lpthread -lm -ldl -I"${MKLROOT}/include" -o gnu_gnu.out

icpx kernel_only_pardiso.cpp -m64 -L${MKLROOT}/lib -Wl,--no-as-needed -lmkl_intel_lp64 -lmkl_gnu_thread -lmkl_core -lgomp -lpthread -lm -ldl -I"${MKLROOT}/include" -o intel_gnu.out

The two compilations with icpx show equivalent results in terms of timing but the result with gcc takes between 4 and 2 times more w.r.t. the version compiled with icpx. It seems to me that case 2 and case 3 should use the same version of Intel oneAPI.

Do you have any suggestion about the explanation of such a difference?

c_sim · ‎10-09-2025

Hi @NicolaGiuliani

Thank you for this interesting discussion. We investigated the issue and found that the performance difference you observe is caused by subnormal (denormal) numbers that appear during the factorization phase. By default, when using the Intel compiler, flush-to-zero (FTZ) option is activated, which flushes all these subnormals to zero for all optimization levels except 'O0'. On the other hand, this is not the case for GCC, where FTZ is disabled by default and enabled only when you set flags like '-funsafe-math-optimizations'. Note that starting from GCC version 12.4, there is a separate flag '-mdaz-ftz' to enable this option.

There is also an interesting paper that describes the effect of these options on solvers, where, among others, an old version of oneMKL PARDISO is compared: Zounon et al. 2022, Performance impact of precision reduction in sparse linear systems solvers, https://doi.org/10.7717/peerj-cs.778

If you want to avoid other optimizations that arise with '-funsafe-math-optimizations' in GCC, you could also enable the options manually by toggling the MXCSR register. An example is given in the paper mentioned above, or you could use Intel IPP routines (SetFlushToZero) to enable this. However, note that with GNU threading, this has to be done before the first parallel region appears in the code, so for newer versions of PARDISO, it must be done before phase 1.

I hope this helps you gain a clearer understanding. Thank you once again for posting.

Kind Regards,

Chris

View solution in original post

Gajanan_Choudhary · ‎05-30-2025

Hi @NicolaGiuliani,

Please attach your code for us to be able to help you out.

Without the code we can only make guesses, so here is mine:

icpx kernel_only_pardiso.cpp -qmkl=parallel -o intel_intel.out

is similar to (but maybe not exactly same as):

icpx kernel_only_pardiso.cpp -m64 -L${MKLROOT}/lib -Wl,--no-as-needed -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -liomp -lpthread -lm -ldl -I"${MKLROOT}/include" -o intel_intel.out

The difference between your (1) and (3) options (icpx compilation) is mainly the "Intel versus GNU threading layer", i.e., `-lmkl_intel_thread -liomp` versus `-lmkl_gnu_thread -lgomp`. In both cases, `kernel_only_pardiso.cpp` is also compiled with icpx. You mentioned you got similar timings for those cases, so in this particular case, if appears that using`-lmkl_intel_thread -liomp` versus `-lmkl_gnu_thread -lgomp` does not matter for performance.

Although it may initially appear that there should then also be no difference between g++ compilation (2) versus icpx compilation (3), the difference there is that your `kernel_only_pardiso.cpp` is being compiled with g++ versus icpx. The difference in timings indicates that you are doing a lot more work in that file outside the oneMKL PARDISO function calls. Is that the case? (The oneMKL library is pre-compiled with Intel compiler(s) that you are only linking to and not really compiling with g++ in (2) or icpx in (3), if you get what I mean; if that's not clear, maybe I can elaborate in another reply).

However, without looking at the code, this is still just a guess. It might be something entirely different, we can only be sure if you can share the code.

Hope that helps,

Gajanan Choudhary

Intel oneMKL team

NicolaGiuliani · ‎06-03-2025

Hi @Gajanan_Choudhary ,

thank you for the swift reply. I attach the simple example I was talking about.

a.out AScaledReal5_1.txt bScaledReal5_1.txt

Thank you again,

Nicola

NicolaGiuliani · ‎06-05-2025

Hi @Gajanan_Choudhary ,

I have done some additional digging and I have some news.

g++ -Ofast kernel_only_pardiso.cpp -m64 -L${MKLROOT}/lib -Wl,--no-as-needed -lmkl_intel_lp64 -lmkl_gnu_thread -lmkl_core -lgomp -lpthread -lm -ldl -I"${MKLROOT}/include" -o gnu_gnu.out

has the same performance than the intel compiled version. A further refinement shows that

g++ -funsafe-math-optimizations kernel_only_pardiso.cpp -m64 -L${MKLROOT}/lib -Wl,--no-as-needed -lmkl_intel_lp64 -lmkl_gnu_thread -lmkl_core -lgomp -lpthread -lm -ldl -I"${MKLROOT}/include" -o gnu_gnu.out

shows the same performance as the the intel compiled one. Looking at the gcc man page -funsafe-math-optimizations turns on -fno-signed-zeros,-fno-trapping-math,-fassociative-math and -freciprocal-math.

The man page also states that "When used at link time, it may include libraries or startup files that change the default FPU control word or other similar optimizations". So I tried to identify if any of the previous flags is responsible of the performance increase.

g++ -fno-signed-zeros,-fno-trapping-math,-fassociative-math and -freciprocal-math kernel_only_pardiso.cpp -m64 -L${MKLROOT}/lib -Wl,--no-as-needed -lmkl_intel_lp64 -lmkl_gnu_thread -lmkl_core -lgomp -lpthread -lm -ldl -I"${MKLROOT}/include" -o gnu_gnu.out

shows the same slowdown w.r.t. the intel compiled one so my conclusion is that it is not the compilation of kernel_only_pardiso.cpp that causes a performance fluctuation but the different linking of the mkl libraries that is influenced by -funsafe-math-optimizations . Do you have any idea bout what's changing at linking level, or any idea on how to find out?

Thank you again,

Nicola

c_sim · ‎10-09-2025

Hi @NicolaGiuliani

Thank you for this interesting discussion. We investigated the issue and found that the performance difference you observe is caused by subnormal (denormal) numbers that appear during the factorization phase. By default, when using the Intel compiler, flush-to-zero (FTZ) option is activated, which flushes all these subnormals to zero for all optimization levels except 'O0'. On the other hand, this is not the case for GCC, where FTZ is disabled by default and enabled only when you set flags like '-funsafe-math-optimizations'. Note that starting from GCC version 12.4, there is a separate flag '-mdaz-ftz' to enable this option.

There is also an interesting paper that describes the effect of these options on solvers, where, among others, an old version of oneMKL PARDISO is compared: Zounon et al. 2022, Performance impact of precision reduction in sparse linear systems solvers, https://doi.org/10.7717/peerj-cs.778

If you want to avoid other optimizations that arise with '-funsafe-math-optimizations' in GCC, you could also enable the options manually by toggling the MXCSR register. An example is given in the paper mentioned above, or you could use Intel IPP routines (SetFlushToZero) to enable this. However, note that with GNU threading, this has to be done before the first parallel region appears in the code, so for newer versions of PARDISO, it must be done before phase 1.

I hope this helps you gain a clearer understanding. Thank you once again for posting.

Kind Regards,

Chris

NicolaGiuliani · ‎10-09-2025

Hi @c_sim !

Thank you for your reply and useful insights on the problem. We tested on our side and we confirm that the performance issue is linked to the subnormal numbers.

Bests,

Nicola