Community
cancel
Showing results for
Did you mean:
Beginner
268 Views

Hello,

We meet a problem in using Pardiso with multiple threads. Our purpose is solving: AX=B, X is a non-symmetry sparse matrix, A & B are vector. What we do is initialize the Pardiso,  LU decomposition by phase 22, then call function "pardiso()" to solve the problem by phase 33.

The phenomenon is: if setting up the MKL by  "mkl_set_dynamic( true )" and do nothing else, the solver always uses 1 thread only, not matter how big the matrix could be, from 1K unknown to 1M unknown.  If  setting up the MKL by  "mkl_set_dynamic( true )" and "mkl_set_num_threads(4)", we can see 4 threads are used (CPU: i7-3770,  i7-5820k, and dual xeon x5460), but the solving speed slow down by spend double time.

We also use MKL blas functions in other parts of our code, "mkl_set_dynamic( true )" works well, and we can observe multiple threads are used and can get a linear sppedup.

So, our question is how to enable Pardiso with multiple threads and get a reasonable speedup?

I'd appreciate any input and ideas,

mq

7 Replies
Employee
268 Views

mq,

A couple of questions.  Did you mean X and B were vectors and A is a matrix ?

What iparm settings are you using?  Can you provide an output log of an example run that prints out the iparm variables as well as the other standard pardiso output?

iparm[1] = 3  turns on METIS with threading for improved phase 11 reordering performance

iparm[24] affects the parallelism of the solve step...  =0 (Default) is sequential,  =1 does parallel solves, =2 may use parallelism based on rhs instead of matrix or for one rhs is the same as =1.

There are a lot of other choices for iparm settings which can affect the performance of your algorithm.

Now, you should use (for a lot of high performance data intensive operations like found in math libraries) as many threads as there are physical cores.  So if you have hyperthreading turned on, you should probably only use half the available threads (=the number of physical cores). You are correctly using the mkl_set_num_threads() and mkl_set_dynamic() which tell it how many threads are available to be used and give it some options to adjust.  The mkl_set_num_threads() is required.

Hope this helps somewhat and I look forward to seeing more details.

Spencer

Beginner
268 Views

And is it true that Pardiso 6.0 performs better than MKL PARDISO as proposed in their website?
https://pardiso-project.org/

Moderator
268 Views

We didn't compare the performance with this version of Pardiso. Yo could take the latest Intel MKL Pardiso v.2019u1 and make the comparision on your side.

Beginner
268 Views

Sorry for making a mistake in A and X, actually, A is the Sparse matrix,  X is the unknown vector that need to be solved.

Following is the Iparm we use
---------------------------------------------------------
for (int i = 0; i < 64; i++)  mIparm = 0;

/////////////
mIparm[0] = 1; /* No solver default */
mIparm[1] = 2; /* Fill-in reordering from METIS */

mIparm[3] = 0; /* No iterative-direct algorithm */
mIparm[4] = 0; /* No user fill-in reducing permutation */
mIparm[5] = 0; /* Write solution into x */
mIparm[6] = 0; /* Not in use */
mIparm[7] = 2; /* Max numbers of iterative refinement steps */
mIparm[8] = 0; /* Not in use */
mIparm[9] = 13; /* Perturb the pivot elements with 1E-13 */
mIparm[10] = 1; /* Use nonsymmetric permutation and scaling MPS */
mIparm[11] = 0; /* Not in use */
mIparm[12] = 1; /* Maximum weighted matching algorithm is switched-on (default for non-symmetric) */
mIparm[13] = 0; /* Output: Number of perturbed pivots */
mIparm[14] = 0; /* Not in use */
mIparm[15] = 0; /* Not in use */
mIparm[16] = 0; /* Not in use */
mIparm[17] = -1; /* Output: Number of nonzeros in the factor LU */
mIparm[18] = -1; /* Output: Mflops for LU factorization */
mIparm[19] = 0; /* Output: Numbers of CG Iterations */
mIparm[27] = 1; /* 0 for double, 1 for float */
mIparm[59] = 0; /* Padiso versions: in-core or out-of-core */

////////////////////////////////////////////////////////////////
mMaxfct = 1; /* Maximum number of numerical factorizations. */
mMnum = 1; /* Which factorization to use. */
mMsglvl = 0; /* Do not print statistical information in file */
mError = 0; /* Initialize error flag */
mMtype = 11; /* Real unsymmetric matrix */
mNrhs = 1; /* Number of right hand sides. */

---------------------------------------------------------------------------------

There is only one rhs in the system.  In current code, we don't set Iparm[24].  But I already test Iparm[24]=1, it does not help, 4 threads make the code slower than the single thread.

A more detail about the code is, our code is a transient EM simulator, it will run thousands steps. In each step, AX=B need to be solved. But the A is a fix matrix in time stepping, only B is changed in each step.

the detail scheme of our code is:
--------------------------------------------
Init Pardiso
Set A to Pardiso
LU decomposition for A and keep all values (phase 22)
time stepping to solve AX=B (in each time step, B is changed)   (phase 33)
Terminate Pardiso (phase -1)
--------------------------------------------

The test method is, for the same case, the same matrix, the same step number, what is the total time to solve the problem. In the test, 4 threads need a much longer time compared to single thread; 2 threads has a similar performance as single thread.  We can't sure whether we make something wrong in the code scheme, or don't set up Pardiso correctly.

Moderator
268 Views

Spencer recommended to try iparm[24]  but I don't you tried. the similar with  regard to reordering part - iparm[2] == 3.

Beginner
268 Views

We test iparm[24] = 1 or 2;  and iparm[2] = 3;  it looks like change nothing. But what we use is v2013, maybe this version is too old, we ask one of our friend who has v2019 to test, he can get a speedup by 2.4 with 8 & 12 threads; 2 with 4 threads. We think that is OK and will find a time to switch to v2019.