Threading Problem in Pardiso

yuan__mengqing · ‎12-19-2018

Hello,

We meet a problem in using Pardiso with multiple threads. Our purpose is solving: AX=B, X is a non-symmetry sparse matrix, A & B are vector. What we do is initialize the Pardiso, LU decomposition by phase 22, then call function "pardiso()" to solve the problem by phase 33.

The phenomenon is: if setting up the MKL by "mkl_set_dynamic( true )" and do nothing else, the solver always uses 1 thread only, not matter how big the matrix could be, from 1K unknown to 1M unknown. If setting up the MKL by "mkl_set_dynamic( true )" and "mkl_set_num_threads(4)", we can see 4 threads are used (CPU: i7-3770, i7-5820k, and dual xeon x5460), but the solving speed slow down by spend double time.

We also use MKL blas functions in other parts of our code, "mkl_set_dynamic( true )" works well, and we can observe multiple threads are used and can get a linear sppedup.

So, our question is how to enable Pardiso with multiple threads and get a reasonable speedup?

I'd appreciate any input and ideas,

mq

Spencer_P_Intel · ‎12-19-2018

mq,

A couple of questions. Did you mean X and B were vectors and A is a matrix ?

What iparm settings are you using? Can you provide an output log of an example run that prints out the iparm variables as well as the other standard pardiso output?

iparm[1] = 3 turns on METIS with threading for improved phase 11 reordering performance

iparm[24] affects the parallelism of the solve step... =0 (Default) is sequential, =1 does parallel solves, =2 may use parallelism based on rhs instead of matrix or for one rhs is the same as =1.

There are a lot of other choices for iparm settings which can affect the performance of your algorithm.

Now, you should use (for a lot of high performance data intensive operations like found in math libraries) as many threads as there are physical cores. So if you have hyperthreading turned on, you should probably only use half the available threads (=the number of physical cores). You are correctly using the mkl_set_num_threads() and mkl_set_dynamic() which tell it how many threads are available to be used and give it some options to adjust. The mkl_set_num_threads() is required.

Hope this helps somewhat and I look forward to seeing more details.

Spencer

wen_qiang_z_ · ‎12-19-2018

So I`m wondering if we can use all the threads we have to get the best performance of our computer?

And is it true that Pardiso 6.0 performs better than MKL PARDISO as proposed in their website?
https://pardiso-project.org/

Gennady_F_Intel · ‎12-19-2018

We didn't compare the performance with this version of Pardiso. Yo could take the latest Intel MKL Pardiso v.2019u1 and make the comparision on your side.

yuan__mengqing · ‎12-19-2018

Sorry for making a mistake in A and X, actually, A is the Sparse matrix, X is the unknown vector that need to be solved.

Following is the Iparm we use
---------------------------------------------------------
for (int i = 0; i < 64; i++) mIparm = 0;

    /////////////
    mIparm[0] = 1; /* No solver default */
    mIparm[1] = 2; /* Fill-in reordering from METIS */

    mIparm[3] = 0; /* No iterative-direct algorithm */
    mIparm[4] = 0; /* No user fill-in reducing permutation */
    mIparm[5] = 0; /* Write solution into x */
    mIparm[6] = 0; /* Not in use */
    mIparm[7] = 2; /* Max numbers of iterative refinement steps */
    mIparm[8] = 0; /* Not in use */
    mIparm[9] = 13; /* Perturb the pivot elements with 1E-13 */
    mIparm[10] = 1; /* Use nonsymmetric permutation and scaling MPS */
    mIparm[11] = 0; /* Not in use */
    mIparm[12] = 1; /* Maximum weighted matching algorithm is switched-on (default for non-symmetric) */
    mIparm[13] = 0; /* Output: Number of perturbed pivots */
    mIparm[14] = 0; /* Not in use */
    mIparm[15] = 0; /* Not in use */
    mIparm[16] = 0; /* Not in use */
    mIparm[17] = -1; /* Output: Number of nonzeros in the factor LU */
    mIparm[18] = -1; /* Output: Mflops for LU factorization */
    mIparm[19] = 0; /* Output: Numbers of CG Iterations */
    mIparm[27] = 1; /* 0 for double, 1 for float */
    mIparm[59] = 0; /* Padiso versions: in-core or out-of-core */

    ////////////////////////////////////////////////////////////////
    mMaxfct = 1; /* Maximum number of numerical factorizations. */
    mMnum = 1; /* Which factorization to use. */
    mMsglvl = 0; /* Do not print statistical information in file */
    mError = 0; /* Initialize error flag */
    mMtype = 11; /* Real unsymmetric matrix */
    mNrhs = 1; /* Number of right hand sides. */

---------------------------------------------------------------------------------

There is only one rhs in the system. In current code, we don't set Iparm[24]. But I already test Iparm[24]=1, it does not help, 4 threads make the code slower than the single thread.

A more detail about the code is, our code is a transient EM simulator, it will run thousands steps. In each step, AX=B need to be solved. But the A is a fix matrix in time stepping, only B is changed in each step.

the detail scheme of our code is:
--------------------------------------------
Init Pardiso
Set A to Pardiso
LU decomposition for A and keep all values (phase 22)
time stepping to solve AX=B (in each time step, B is changed) (phase 33)
Terminate Pardiso (phase -1)
--------------------------------------------

The test method is, for the same case, the same matrix, the same step number, what is the total time to solve the problem. In the test, 4 threads need a much longer time compared to single thread; 2 threads has a similar performance as single thread. We can't sure whether we make something wrong in the code scheme, or don't set up Pardiso correctly.

Gennady_F_Intel · ‎12-20-2018

Spencer recommended to try iparm[24] but I don't you tried. the similar with regard to reordering part - iparm[2] == 3.

yuan__mengqing · ‎12-20-2018

We test iparm[24] = 1 or 2; and iparm[2] = 3; it looks like change nothing. But what we use is v2013, maybe this version is too old, we ask one of our friend who has v2019 to test, he can get a speedup by 2.4 with 8 & 12 threads; 2 with 4 threads. We think that is OK and will find a time to switch to v2019.

Thanks for your help.

Spencer_P_Intel · ‎12-20-2018

I wanted to add one last note on things that might help:

It definitely doesn't hurt to try giving it all the physical threads available. In reality the parallelism for solve step is actually tied very directly to the sparsity pattern of the L and U factors which are tied to the sparsity pattern of the original matrix. You could imagine performing the LU factorization on a tridiagonal matrix system. At the end of the day, your L and U matrices have a main diagonal and an off diagonal. In this case, there is no parallelism at all to be exploited because each node (row or column) depends on another row or column being finished first. the elimination tree is a path. In this case, it doesn't matter how many threads you have, it can only use 1 thread at a time. Now, real matrices from applications have varying amounts of this phenomena so the answer of how many threads could be used is highly dependent on the system you are solving. If you are going to be solving the same system over and over, it might make sense to determine the optimal number of threads and only give it that much. But in general, it is better to give it all of them and let the sparsity structure (at the end of the day the independent branches of the elimination tree govern the parallelism working from leaves up to root) decide how much parallelism can be used. We typically observe the greatest benefit from parallelism in the factorize stage (phase 22) and in parts of the reordering (phase 11) stage. Solving can get some benefit and we try to give it as much as is possible but it is definitely not as much as for the phase 22.

I am glad to hear that MKL2019 is behaving more as expected for you. We did do some big improvements for pardiso between 2013 and now.

Some other things you might play with. The weighted matching and scaling are good for improving stability of solution (especially in indefinite systems) but they often slow things way down. If it is possible to solve without them to desired accuracy, it is worth it. The factorization algorithm is also something you can play with. iparm[23] governs the various factorization algorithms. We generally recommend to try with iparm[23] =1 in general or =10 for nonsymmetric to see if it can help give better performance. The original algorithm (iparm[23]=0) is reliable but may not give as good performance as the newer options. If you have time, you should try those out. Also, the vbsr matrix storage format can help quite a bit for certain matrix structures. Try using something like iparm[36]=-90 to see if it helps with the other factorization routines as well.

Best of luck!