- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi all,
my name is Nicola Giuliani and I am a Software Engineering working in the field of numerical analysis.
In the past days I have been testing the PARDISO solver on a sparse matrix with 486k rows/cols and 3.3M non zeros. I am storing it in CSR format and calling PARDISO to solve it. I am compiling the program (that creates the CSR matrix and then factorizes it). I have tested three different scenarios using both intel an gcc (9.2 and 10.2). I use 1 2 4 8 16 32 64 threads to see the scalability of the solution. I am using Intel oneAPI version 2025.1.1.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you for this interesting discussion. We investigated the issue and found that the performance difference you observe is caused by subnormal (denormal) numbers that appear during the factorization phase. By default, when using the Intel compiler, flush-to-zero (FTZ) option is activated, which flushes all these subnormals to zero for all optimization levels except 'O0'. On the other hand, this is not the case for GCC, where FTZ is disabled by default and enabled only when you set flags like '-funsafe-math-optimizations'. Note that starting from GCC version 12.4, there is a separate flag '-mdaz-ftz' to enable this option.
There is also an interesting paper that describes the effect of these options on solvers, where, among others, an old version of oneMKL PARDISO is compared: Zounon et al. 2022, Performance impact of precision reduction in sparse linear systems solvers, https://doi.org/10.7717/peerj-cs.778
If you want to avoid other optimizations that arise with '-funsafe-math-optimizations' in GCC, you could also enable the options manually by toggling the MXCSR register. An example is given in the paper mentioned above, or you could use Intel IPP routines (SetFlushToZero) to enable this. However, note that with GNU threading, this has to be done before the first parallel region appears in the code, so for newer versions of PARDISO, it must be done before phase 1.
I hope this helps you gain a clearer understanding. Thank you once again for posting.
Kind Regards,
Chris
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi @NicolaGiuliani,
Please attach your code for us to be able to help you out.
Without the code we can only make guesses, so here is mine:
icpx kernel_only_pardiso.cpp -qmkl=parallel -o intel_intel.out
is similar to (but maybe not exactly same as):
icpx kernel_only_pardiso.cpp -m64 -L${MKLROOT}/lib -Wl,--no-as-needed -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -liomp -lpthread -lm -ldl -I"${MKLROOT}/include" -o intel_intel.out
The difference between your (1) and (3) options (icpx compilation) is mainly the "Intel versus GNU threading layer", i.e., `-lmkl_intel_thread -liomp` versus `-lmkl_gnu_thread -lgomp`. In both cases, `kernel_only_pardiso.cpp` is also compiled with icpx. You mentioned you got similar timings for those cases, so in this particular case, if appears that using`-lmkl_intel_thread -liomp` versus `-lmkl_gnu_thread -lgomp` does not matter for performance.
Although it may initially appear that there should then also be no difference between g++ compilation (2) versus icpx compilation (3), the difference there is that your `kernel_only_pardiso.cpp` is being compiled with g++ versus icpx. The difference in timings indicates that you are doing a lot more work in that file outside the oneMKL PARDISO function calls. Is that the case? (The oneMKL library is pre-compiled with Intel compiler(s) that you are only linking to and not really compiling with g++ in (2) or icpx in (3), if you get what I mean; if that's not clear, maybe I can elaborate in another reply).
However, without looking at the code, this is still just a guess. It might be something entirely different, we can only be sure if you can share the code.
Hope that helps,
Gajanan Choudhary
Intel oneMKL team
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi @Gajanan_Choudhary ,
thank you for the swift reply. I attach the simple example I was talking about.
a.out AScaledReal5_1.txt bScaledReal5_1.txt
Thank you again,
Nicola
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi @Gajanan_Choudhary ,
I have done some additional digging and I have some news.
g++ -Ofast kernel_only_pardiso.cpp -m64 -L${MKLROOT}/lib -Wl,--no-as-needed -lmkl_intel_lp64 -lmkl_gnu_thread -lmkl_core -lgomp -lpthread -lm -ldl -I"${MKLROOT}/include" -o gnu_gnu.out
has the same performance than the intel compiled version. A further refinement shows that
g++ -funsafe-math-optimizations kernel_only_pardiso.cpp -m64 -L${MKLROOT}/lib -Wl,--no-as-needed -lmkl_intel_lp64 -lmkl_gnu_thread -lmkl_core -lgomp -lpthread -lm -ldl -I"${MKLROOT}/include" -o gnu_gnu.out
shows the same performance as the the intel compiled one. Looking at the gcc man page -funsafe-math-optimizations turns on -fno-signed-zeros,-fno-trapping-math,-fassociative-math and -freciprocal-math.
The man page also states that "When used at link time, it may include libraries or startup files that change the default FPU control word or other similar optimizations". So I tried to identify if any of the previous flags is responsible of the performance increase.
g++ -fno-signed-zeros,-fno-trapping-math,-fassociative-math and -freciprocal-math kernel_only_pardiso.cpp -m64 -L${MKLROOT}/lib -Wl,--no-as-needed -lmkl_intel_lp64 -lmkl_gnu_thread -lmkl_core -lgomp -lpthread -lm -ldl -I"${MKLROOT}/include" -o gnu_gnu.out
shows the same slowdown w.r.t. the intel compiled one so my conclusion is that it is not the compilation of kernel_only_pardiso.cpp that causes a performance fluctuation but the different linking of the mkl libraries that is influenced by -funsafe-math-optimizations . Do you have any idea bout what's changing at linking level, or any idea on how to find out?
Thank you again,
Nicola
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you for this interesting discussion. We investigated the issue and found that the performance difference you observe is caused by subnormal (denormal) numbers that appear during the factorization phase. By default, when using the Intel compiler, flush-to-zero (FTZ) option is activated, which flushes all these subnormals to zero for all optimization levels except 'O0'. On the other hand, this is not the case for GCC, where FTZ is disabled by default and enabled only when you set flags like '-funsafe-math-optimizations'. Note that starting from GCC version 12.4, there is a separate flag '-mdaz-ftz' to enable this option.
There is also an interesting paper that describes the effect of these options on solvers, where, among others, an old version of oneMKL PARDISO is compared: Zounon et al. 2022, Performance impact of precision reduction in sparse linear systems solvers, https://doi.org/10.7717/peerj-cs.778
If you want to avoid other optimizations that arise with '-funsafe-math-optimizations' in GCC, you could also enable the options manually by toggling the MXCSR register. An example is given in the paper mentioned above, or you could use Intel IPP routines (SetFlushToZero) to enable this. However, note that with GNU threading, this has to be done before the first parallel region appears in the code, so for newer versions of PARDISO, it must be done before phase 1.
I hope this helps you gain a clearer understanding. Thank you once again for posting.
Kind Regards,
Chris
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi @c_sim !
Thank you for your reply and useful insights on the problem. We tested on our side and we confirm that the performance issue is linked to the subnormal numbers.
Bests,
Nicola

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page