Pardiso 32-bit vs 64-bit performance

di_luca__daniele1 · ‎07-31-2019

Using Pardiso in an iterative routine, where the Pardiso solver is called many times, I noticed the 64-bit version is significantly slower than the 32-bit version. I'm not sure yet if it depends by Pardiso (only) or by something else, then I was wondering if do such a performance comparison exist?

Is there any way to investigate if the speed problem could concern the 64-bit version of Pardiso and if it is possible to improve its performance?

Thanks in advance,

Daniele

Kirill_V_Intel · ‎07-31-2019

Hello Daniele,

The performance difference can be easily explained. Whenever you link Intel MKL against ILP64 interface (64-bit integers) all data defined as MKL_INT is interpreted as 64-bit integers. Thus, if the functionality which you're using is memory-bound (performance depends on how fast you can go over the data) then this immediately means that you're passing twice more bytes, hence a slowdown. However,, internally, some data is stored in PARDISO always as long integers. So we'd expect the difference to be about 20-30%.

So, if you data fits into 32-bit integers, better use lp64 interface.

In your case there could be potentially a couple of options. But you need to tell us more about how you're calling pardiso and what changes between iterations. For example, if the matrix remains the same and you just want to solve with different righthand sides, you can factor the matrix only once (phase 12 or 11 plus 22) and then in the loop call only phase 33 (solving). Or, if the matrix changes, but only slightly, you could again re-use the factorization, either in the iterative refinement process or as a preconditioner within CGS/CG.

Also, can you give more information about the slowdown. What is the side of your matrix? How the times lp64 vs ilp64 compare with each other? Which threading are you using?Is the slowdown observed for the entire iterative loop (then, maybe, the slowdown happens due to the other intense routines, not only PARDISO) or specifically for PARDISO?

Best,
Kirill

di_luca__daniele1 · ‎08-01-2019

Dear Kirill,

thank you very much for your help, much appreciated.

Actually, I’m using 32-bit integers only, than I already switched to lp64 interface.

As it may change dramatically, the sparse matrix needs to be re-factored at every loop and to boost the procedure, despite the loss of accuracy, I’m trying to avoid a massive use of conjugated (conditioned) gradient refinements, than I set the max number of iterative refinements (iparm[7]) up to 2, but only during phase 11 and phase 22, while I set it to 0 when phase is 33, it it fine?

I’m using the multithread version of MKL library and I linked it statically by using:

mkl_intel_lp64.lib

mkl_core.lib

mkl_intel_thread.lib

libiomp5md.lib

I set Pardiso parameters as it is shown below (refering to: https://software.intel.com/en-us/mkl-developer-reference-c-pardiso#BEA1BEA6-A0C0-46C4-A8EC-EC42BA473E1D input parameters list):

std::vector<void*> pt(64); (I use pt.data() as argument for Pardiso);

maxfct=1;

mnum=1;

mtype=-2;

nrhs=0; (I set it to 1 only when Pardiso phase is 33)

And about iparm parameters:

iparm[0]=1;

iparm[1]=2;

iparms[3]= number of thread processors (in my case: 8, I also specified them earlier by calling mkl_set_num_threads(…) );

iparm[7]=2; (I set it to 0 when Pardiso phase is 33)

iparm[9]=12;

iparm[10]=1;

iparm[12]=1;

iparm[17]=-1;

iparm[20]=2;

would setting iparm[23]=1 help improving performance if threads number is equal or less than 8?

About the test I was looking at:

- the sparse matrix size is about 21000 x 21000 and it contains about 239.000 non-zero values.

I noticed a different speed by changing the Pardiso inner pointer ‘pt’, that is:

if ‘pt’ is defined as vector<void*> pt(64) (then pt.data() as argument in Pardiso),the total time spent is 440 seconds, while if ‘pt’ is defined as vector<int> pt(128), the total time spent is about 420 seconds.

I’m not sure the slowdown depends (only) by Pardiso, but I need to know if I’m setting it at its best.

Thank you in advance,

kind regards,

Daniele

Voronin, Kirill (Intel) wrote:
Hello Daniele,
The performance difference can be easily explained. Whenever you link Intel MKL against ILP64 interface (64-bit integers) all data defined as MKL_INT is interpreted as 64-bit integers. Thus, if the functionality which you're using is memory-bound (performance depends on how fast you can go over the data) then this immediately means that you're passing twice more bytes, hence a slowdown. However,, internally, some data is stored in PARDISO always as long integers. So we'd expect the difference to be about 20-30%.
So, if you data fits into 32-bit integers, better use lp64 interface.
In your case there could be potentially a couple of options. But you need to tell us more about how you're calling pardiso and what changes between iterations. For example, if the matrix remains the same and you just want to solve with different righthand sides, you can factor the matrix only once (phase 12 or 11 plus 22) and then in the loop call only phase 33 (solving). Or, if the matrix changes, but only slightly, you could again re-use the factorization, either in the iterative refinement process or as a preconditioner within CGS/CG.
Also, can you give more information about the slowdown. What is the side of your matrix? How the times lp64 vs ilp64 compare with each other? Which threading are you using?Is the slowdown observed for the entire iterative loop (then, maybe, the slowdown happens due to the other intense routines, not only PARDISO) or specifically for PARDISO?
Best,
Kirill