Intel® oneAPI Math Kernel Library
Ask questions and share information with other developers who use Intel® Math Kernel Library.
7220 Discussions

PARDISO performance fluctuation on INTEL processor

NicolaGiuliani
Beginner
1,035 Views

Hi all,

my name is Nicola Giuliani and I am a Software Engineering working in the field of numerical analysis. 

In the past days I have been testing the PARDISO solver on a sparse matrix with 486k rows/cols and 3.3M non zeros. I am storing it in CSR format and calling PARDISO to solve it. I am compiling the program (that creates the CSR matrix and then factorizes it). I have tested three different scenarios using both intel an gcc (9.2 and 10.2). I use 1 2 4 8 16 32 64 threads to see the scalability of the solution. I am using Intel oneAPI version 2025.1.1.

icpx kernel_only_pardiso.cpp -qmkl=parallel -o intel_intel.out
 
g++ kernel_only_pardiso.cpp -m64 -L${MKLROOT}/lib -Wl,--no-as-needed -lmkl_intel_lp64 -lmkl_gnu_thread -lmkl_core -lgomp -lpthread -lm -ldl -I"${MKLROOT}/include" -o gnu_gnu.out
 
icpx kernel_only_pardiso.cpp -m64 -L${MKLROOT}/lib -Wl,--no-as-needed -lmkl_intel_lp64 -lmkl_gnu_thread -lmkl_core -lgomp -lpthread -lm -ldl -I"${MKLROOT}/include" -o intel_gnu.out
 
The two compilations with icpx show equivalent results in terms of timing but the result with gcc takes between 4 and 2 times more w.r.t. the version compiled with icpx. It seems to me that case 2 and case 3 should use the same version of Intel oneAPI.
 
Do you have any suggestion about the explanation of such a difference? 

 

0 Kudos
3 Replies
Gajanan_Choudhary
996 Views

Hi @NicolaGiuliani,

Please attach your code for us to be able to help you out.

Without the code we can only make guesses, so here is mine: 

icpx kernel_only_pardiso.cpp -qmkl=parallel -o intel_intel.out

is similar to (but maybe not exactly same as):

icpx kernel_only_pardiso.cpp -m64 -L${MKLROOT}/lib -Wl,--no-as-needed -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -liomp -lpthread -lm -ldl -I"${MKLROOT}/include" -o intel_intel.out

The difference between your (1) and (3) options (icpx compilation) is mainly the "Intel versus GNU threading layer", i.e., `-lmkl_intel_thread -liomp` versus `-lmkl_gnu_thread -lgomp`. In both cases, `kernel_only_pardiso.cpp` is also compiled with icpx. You mentioned you got similar timings for those cases, so in this particular case, if appears that using`-lmkl_intel_thread -liomp` versus `-lmkl_gnu_thread -lgomp` does not matter for performance.

Although it may initially appear that there should then also be no difference between g++ compilation (2) versus icpx compilation (3), the difference there is that your `kernel_only_pardiso.cpp` is being compiled with g++ versus icpx. The difference in timings indicates that you are doing a lot more work in that file outside the oneMKL PARDISO function calls. Is that the case? (The oneMKL library is pre-compiled with Intel compiler(s) that you are only linking to and not really compiling with g++ in (2) or icpx in (3), if you get what I mean; if that's not clear, maybe I can elaborate in another reply).

 

However, without looking at the code, this is still just a guess. It might be something entirely different, we can only be sure if you can share the code.

 

Hope that helps,

Gajanan Choudhary

Intel oneMKL team

 

 

0 Kudos
NicolaGiuliani
Beginner
888 Views

Hi @Gajanan_Choudhary ,

 

thank you for the swift reply. I attach the simple example I was talking about. 

 

a.out AScaledReal5_1.txt bScaledReal5_1.txt

 

Thank you again,

 

Nicola

 

 

0 Kudos
NicolaGiuliani
Beginner
805 Views

Hi @Gajanan_Choudhary ,

 

I have done some additional digging and I have some news.

 

g++ -Ofast kernel_only_pardiso.cpp -m64 -L${MKLROOT}/lib -Wl,--no-as-needed -lmkl_intel_lp64 -lmkl_gnu_thread -lmkl_core -lgomp -lpthread -lm -ldl -I"${MKLROOT}/include" -o gnu_gnu.out

 

has the same performance than the intel compiled version. A further refinement shows that 

 

g++ -funsafe-math-optimizations kernel_only_pardiso.cpp -m64 -L${MKLROOT}/lib -Wl,--no-as-needed -lmkl_intel_lp64 -lmkl_gnu_thread -lmkl_core -lgomp -lpthread -lm -ldl -I"${MKLROOT}/include" -o gnu_gnu.out

 

shows the same performance as the the intel compiled one. Looking at the gcc man page -funsafe-math-optimizations turns on -fno-signed-zeros,-fno-trapping-math,-fassociative-math and -freciprocal-math. 

The man page also states that "When used at link time, it may include libraries or startup files that change the default FPU control word or other similar optimizations". So I tried to identify if any of the previous flags is responsible of the performance increase.

 

g++ -fno-signed-zeros,-fno-trapping-math,-fassociative-math and -freciprocal-math kernel_only_pardiso.cpp -m64 -L${MKLROOT}/lib -Wl,--no-as-needed -lmkl_intel_lp64 -lmkl_gnu_thread -lmkl_core -lgomp -lpthread -lm -ldl -I"${MKLROOT}/include" -o gnu_gnu.out

 

shows the same slowdown w.r.t. the intel compiled one so my conclusion is that it is not the compilation of kernel_only_pardiso.cpp that causes a performance fluctuation but the different linking of the mkl libraries that is influenced by -funsafe-math-optimizations . Do you have any idea bout what's changing at linking level, or any idea on how to find out?

 

Thank you again,

Nicola

0 Kudos
Reply