Intel® oneAPI Math Kernel Library
Ask questions and share information with other developers who use Intel® Math Kernel Library.
Announcements
FPGA community forums and blogs on community.intel.com are migrating to the new Altera Community and are read-only. For urgent support needs during this transition, please visit the FPGA Design Resources page or contact an Altera Authorized Distributor.
7234 Discussions

PARDISO performance fluctuation on INTEL processor

NicolaGiuliani
Beginner
1,075 Views

Hi all,

my name is Nicola Giuliani and I am a Software Engineering working in the field of numerical analysis. 

In the past days I have been testing the PARDISO solver on a sparse matrix with 486k rows/cols and 3.3M non zeros. I am storing it in CSR format and calling PARDISO to solve it. I am compiling the program (that creates the CSR matrix and then factorizes it). I have tested three different scenarios using both intel an gcc (9.2 and 10.2). I use 1 2 4 8 16 32 64 threads to see the scalability of the solution. I am using Intel oneAPI version 2025.1.1.

icpx kernel_only_pardiso.cpp -qmkl=parallel -o intel_intel.out
 
g++ kernel_only_pardiso.cpp -m64 -L${MKLROOT}/lib -Wl,--no-as-needed -lmkl_intel_lp64 -lmkl_gnu_thread -lmkl_core -lgomp -lpthread -lm -ldl -I"${MKLROOT}/include" -o gnu_gnu.out
 
icpx kernel_only_pardiso.cpp -m64 -L${MKLROOT}/lib -Wl,--no-as-needed -lmkl_intel_lp64 -lmkl_gnu_thread -lmkl_core -lgomp -lpthread -lm -ldl -I"${MKLROOT}/include" -o intel_gnu.out
 
The two compilations with icpx show equivalent results in terms of timing but the result with gcc takes between 4 and 2 times more w.r.t. the version compiled with icpx. It seems to me that case 2 and case 3 should use the same version of Intel oneAPI.
 
Do you have any suggestion about the explanation of such a difference? 

 

0 Kudos
3 Replies
Gajanan_Choudhary
1,036 Views

Hi @NicolaGiuliani,

Please attach your code for us to be able to help you out.

Without the code we can only make guesses, so here is mine: 

icpx kernel_only_pardiso.cpp -qmkl=parallel -o intel_intel.out

is similar to (but maybe not exactly same as):

icpx kernel_only_pardiso.cpp -m64 -L${MKLROOT}/lib -Wl,--no-as-needed -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -liomp -lpthread -lm -ldl -I"${MKLROOT}/include" -o intel_intel.out

The difference between your (1) and (3) options (icpx compilation) is mainly the "Intel versus GNU threading layer", i.e., `-lmkl_intel_thread -liomp` versus `-lmkl_gnu_thread -lgomp`. In both cases, `kernel_only_pardiso.cpp` is also compiled with icpx. You mentioned you got similar timings for those cases, so in this particular case, if appears that using`-lmkl_intel_thread -liomp` versus `-lmkl_gnu_thread -lgomp` does not matter for performance.

Although it may initially appear that there should then also be no difference between g++ compilation (2) versus icpx compilation (3), the difference there is that your `kernel_only_pardiso.cpp` is being compiled with g++ versus icpx. The difference in timings indicates that you are doing a lot more work in that file outside the oneMKL PARDISO function calls. Is that the case? (The oneMKL library is pre-compiled with Intel compiler(s) that you are only linking to and not really compiling with g++ in (2) or icpx in (3), if you get what I mean; if that's not clear, maybe I can elaborate in another reply).

 

However, without looking at the code, this is still just a guess. It might be something entirely different, we can only be sure if you can share the code.

 

Hope that helps,

Gajanan Choudhary

Intel oneMKL team

 

 

0 Kudos
NicolaGiuliani
Beginner
928 Views

Hi @Gajanan_Choudhary ,

 

thank you for the swift reply. I attach the simple example I was talking about. 

 

a.out AScaledReal5_1.txt bScaledReal5_1.txt

 

Thank you again,

 

Nicola

 

 

0 Kudos
NicolaGiuliani
Beginner
845 Views

Hi @Gajanan_Choudhary ,

 

I have done some additional digging and I have some news.

 

g++ -Ofast kernel_only_pardiso.cpp -m64 -L${MKLROOT}/lib -Wl,--no-as-needed -lmkl_intel_lp64 -lmkl_gnu_thread -lmkl_core -lgomp -lpthread -lm -ldl -I"${MKLROOT}/include" -o gnu_gnu.out

 

has the same performance than the intel compiled version. A further refinement shows that 

 

g++ -funsafe-math-optimizations kernel_only_pardiso.cpp -m64 -L${MKLROOT}/lib -Wl,--no-as-needed -lmkl_intel_lp64 -lmkl_gnu_thread -lmkl_core -lgomp -lpthread -lm -ldl -I"${MKLROOT}/include" -o gnu_gnu.out

 

shows the same performance as the the intel compiled one. Looking at the gcc man page -funsafe-math-optimizations turns on -fno-signed-zeros,-fno-trapping-math,-fassociative-math and -freciprocal-math. 

The man page also states that "When used at link time, it may include libraries or startup files that change the default FPU control word or other similar optimizations". So I tried to identify if any of the previous flags is responsible of the performance increase.

 

g++ -fno-signed-zeros,-fno-trapping-math,-fassociative-math and -freciprocal-math kernel_only_pardiso.cpp -m64 -L${MKLROOT}/lib -Wl,--no-as-needed -lmkl_intel_lp64 -lmkl_gnu_thread -lmkl_core -lgomp -lpthread -lm -ldl -I"${MKLROOT}/include" -o gnu_gnu.out

 

shows the same slowdown w.r.t. the intel compiled one so my conclusion is that it is not the compilation of kernel_only_pardiso.cpp that causes a performance fluctuation but the different linking of the mkl libraries that is influenced by -funsafe-math-optimizations . Do you have any idea bout what's changing at linking level, or any idea on how to find out?

 

Thank you again,

Nicola

0 Kudos
Reply