performance of MKL pardiso

negi__ashish · ‎10-12-2011

I have a large sparse system with nonlinear equations. I use in-core PARDISO, direct solver of MKL 10.3 (evaluation version), to solve symmetric (coefficient matrix) and non-symmetric(jacobian matrix) set.

My test model comprises of 370000x370000 size coeff and jacobian matrix. In single iteration, both set take around 36 and 65 sec respectively for phases 11,22 and 33 . System configuration is Xeon 3.46 GHz 4-core with 20GB RAM.

I would like to solve one iteration of at least 1millionx1million size model in 60 sec. including both symmetric and unsymmetric set. When I ran Ansys on same system with this size it hardly took 30 sec to solve 4 iterations (each iteration include one symmetric and one unsymmetric set). Any suggestion to increase the performance will be greatly appreciated.

Thanks,
Ashish

Konstantin_A_Intel · ‎10-12-2011

Hi Ashish,

Please make sure you are running threaded version of MKL. You should link with libmkl_intel_thread library for this, please refer to MKl link line adviser:
http://software.intel.com/en-us/articles/intel-mkl-link-line-advisor/

You may also look how many threads was used by PARDISO by setting msglvl=1.

Regards,
Konstantin

Alexander_K_Intel2 · ‎10-12-2011

Hi,

Could you set msglvl to 1 and include PARDISO output to this topic? It could help us to provide you some advices.

With best regards,

Alexander Kalinkin

negi__ashish · ‎10-12-2011

Hello,

I am using threaded MKL and my OS is windows xp. I have following libs included in my project
mkl_solver_lp64.lib
mkl_intel_lp64.lib
mkl_intel_thread.lib
mkl_core.lib
libiomp5md.lib

Here is output for unsymmetric jacobian

================ PARDISO: solving a real struct. sym. system ================
The local (internal) PARDISO version is : 103000116
1-based array indexing is turned ON
PARDISO double precision computation is turned ON
METIS algorithm at reorder step is turned ON
Scaling is turned ON
Matching is turned ON

Summary: ( reordering phase )
================

Times:
======
Time spent in calculations of symmetric matrix portrait(fulladj): 0.009621 s
Time spent in reordering of the initial matrix(reorder) : 2.682449 s
Time spent in symbolic factorization(symbfct) : 0.701643 s
Time spent in data preparations for factorization(parlist) : 0.033879 s
Time spent in allocation of internal data structures(malloc) : 0.066773 s
Time spent in additional calculations : 0.336471 s
Total time spent : 3.830835 s

Statistics:
===========
< Parallel Direct Factorization with #processors: > 6
< Numerical Factorization with BLAS3 and O(n) synchronization >

< Linear system Ax = b>
#equations: 372745
#non-zeros in A: 5495763
non-zeros in A (): 0.003956

#right-hand sides: 1

< Factors L and U >
#columns for each panel: 72
#independent subgraphs: 0
< Preprocessing with state of the art partitioning metis>
#supernodes: 154074
size of largest supernode: 3960
number of nonzeros in L 268393909
number of nonzeros in U 260128570
number of nonzeros in L+U 528522479
=== PARDISO is running in In-Core mode, because iparam(60)=0 ===
Percentage of computed non-zeros for LL^T factorization
0 %
1 %
2 %
3 %
4 %
5 %
6 %
7 %
8 %
9 %
10 %
11 %
12 %
13 %
14 %
15 %
16 %
17 %
18 %
19 %
20 %
21 %
22 %
23 %
24 %
25 %
26 %
27 %
28 %
29 %
30 %
31 %
32 %
33 %
34 %
35 %
36 %
37 %
38 %
39 %
40 %
41 %
42 %
43 %
44 %
45 %
46 %
47 %
48 %
49 %
50 %
51 %
52 %
53 %
54 %
55 %
56 %
57 %
58 %
59 %
60 %
61 %
62 %
63 %
64 %
65 %
66 %
67 %
68 %
69 %
70 %
71 %
73 %
74 %
75 %
76 %
77 %
78 %
79 %
80 %
81 %
82 %
83 %
84 %
85 %
86 %
87 %
88 %
89 %
90 %
91 %
92 %
93 %
94 %
95 %
96 %
97 %
98 %
99 %
100 %

================ PARDISO: solving a real struct. sym. system ================
Single-level factorization algorithm is turned ON

Summary: ( factorization phase )
================

Times:
======
Time spent in copying matrix to internal data structure(A to LU): 0.000000 s
Time spent in factorization step(numfct) : 51.146111 s
Time spent in allocation of internal data structures(malloc) : 0.000339 s
Time spent in additional calculations : 0.000001 s
Total time spent : 51.146451 s

Statistics:
===========
< Parallel Direct Factorization with #processors: > 6
< Numerical Factorization with BLAS3 and O(n) synchronization >

< Linear system Ax = b>
#equations: 372745
#non-zeros in A: 5495763
non-zeros in A (): 0.003956

#right-hand sides: 1

< Factors L and U >
#columns for each panel: 72
#independent subgraphs: 0
< Preprocessing with state of the art partitioning metis>
#supernodes: 154074
size of largest supernode: 3960
number of nonzeros in L 268393909
number of nonzeros in U 260128570
number of nonzeros in L+U 528522479
gflop for the numerical factorization: 1708.338114

gflop/s for the numerical factorization: 33.401134

================ PARDISO: solving a real struct. sym. system ================

Summary: ( solution phase )
================

Times:
======
Time spent in direct solver at solve step (solve) : 0.316364 s
Time spent in additional calculations : 0.631197 s
Total time spent : 0.947561 s

Statistics:
===========
< Parallel Direct Factorization with #processors: > 6
< Numerical Factorization with BLAS3 and O(n) synchronization >

< Linear system Ax = b>
#equations: 372745
#non-zeros in A: 5495763
non-zeros in A (): 0.003956

#right-hand sides: 1

< Factors L and U >
#columns for each panel: 72
#independent subgraphs: 0
< Preprocessing with state of the art partitioning metis>
#supernodes: 154074
size of largest supernode: 3960
number of nonzeros in L 268393909
number of nonzeros in U 260128570
number of nonzeros in L+U 528522479
gflop for the numerical factorization: 1708.338114

gflop/s for the numerical factorization: 33.401134

I could not print same for symmteric coefficient matrix as output string length exceeded limit of windows command window. Here is partial output. I will try some other way to print it

================

Times:
======
Time spent in copying matrix to internal data structure(A to LU): 0.000000 s
Time spent in factorization step(numfct) : 23.408378 s
Time spent in allocation of internal data structures(malloc) : 0.000514 s
Time spent in additional calculations : 0.000001 s
Total time spent : 23.408893 s

Statistics:
===========
< Parallel Direct Factorization with #processors: > 6
< Numerical Factorization with BLAS3 and O(n) synchronization >

< Linear system Ax = b>
#equations: 372745
#non-zeros in A: 2934254
non-zeros in A (): 0.002112

#right-hand sides: 1

< Factors L and U >
#columns for each panel: 96
#independent subgraphs: 0
< Preprocessing with state of the art partitioning metis>
#supernodes: 153764
size of largest supernode: 3960
number of nonzeros in L 269145349
number of nonzeros in U 1
number of nonzeros in L+U 269145350
gflop for the numerical factorization: 869.989754

gflop/s for the numerical factorization: 37.165742

================ PARDISO: solving a symm. posit. def. system ================

Summary: ( solution phase )
================

Times:
======
Time spent in direct solver at solve step (solve) : 0.298444 s
Time spent in additional calculations : 0.619958 s
Total time spent : 0.918402 s

Statistics:
===========
< Parallel Direct Factorization with #processors: > 6
< Numerical Factorization with BLAS3 and O(n) synchronization >

< Linear system Ax = b>
#equations: 372745
#non-zeros in A: 2934254
non-zeros in A (): 0.002112

#right-hand sides: 1

< Factors L and U >
#columns for each panel: 96
#independent subgraphs: 0
< Preprocessing with state of the art partitioning metis>
#supernodes: 153764
size of largest supernode: 3960
number of nonzeros in L 269145349
number of nonzeros in U 1
number of nonzeros in L+U 269145350
gflop for the numerical factorization: 869.989754

gflop/s for the numerical factorization: 37.165742

negi__ashish · ‎10-12-2011

Hi,

I am copying part of code which set the input and control flags for both symmetric and unsymmteric solver. This may be helpful.

Thanks,
Ashish

INPUT AND CONTROL FLAG FOR SYMMETRIC

MKL_INT l_nMTYPE = 2; /* Real symmetric matrix and positive definite*/
/* RHS and solution vectors. */
MKL_INT l_nNRHS = 1; /* Number of right hand sides. */
/* Internal solver memory pointer l_pMemPt, */
/* 32-bit: int l_pMemPt[64]; 64-bit: long int l_pMemPt[64] */
/* or void *l_pMemPt[64] should be OK on both architectures */
#ifdef WIN64
long int l_pMemPt[64];
#else
int l_pMemPt[64];
#endif
/* Pardiso control parameters. */
MKL_INT l_nIPARM[64];
MKL_INT l_nMAXFCT, l_nMNUM, l_nPHASE, l_nERROR, l_nMSGLVL;
/* Auxiliary variables. */
double l_dDDUM; /* Double dummy */
MKL_INT l_nIDUM; /* Integer dummy. */
/* -------------------------------------------------------------------- */
/* .. Setup Pardiso control parameters. */
/* -------------------------------------------------------------------- */
for(int i = 0; i < 64; i++)
l_nIPARM = 0;
l_nIPARM[0] = 1; /* No solver default */
l_nIPARM[1] = 2; /* Fill-in reordering from METIS */
/* Numbers of processors, value of OMP_NUM_THREADS */
l_nIPARM[2] = 0; //use all the available processors
l_nIPARM[3] = 0; /* No iterative-direct algorithm */
l_nIPARM[4] = 0; /* No user fill-in reducing permutation */
l_nIPARM[5] = 0; /* Write solution into x */
l_nIPARM[6] = 0; /* Not in use */
l_nIPARM[7] = 2; /* Max numbers of iterative refinement steps */
l_nIPARM[8] = 0; /* Not in use */
l_nIPARM[9] = 13; /* Perturb the pivot elements with 1E-13 */
l_nIPARM[10] = 1; /* Use nonsymmetric permutation and scaling MPS */
l_nIPARM[11] = 0; /* Not in use */
l_nIPARM[12] = 0; /* Maximum weighted matching algorithm is switched-off (default for symmetric). Try l_nIPARM[12] = 1 in case of inappropriate accuracy */
l_nIPARM[13] = 0; /* Output: Number of perturbed pivots */
l_nIPARM[14] = 0; /* Not in use */
l_nIPARM[15] = 0; /* Not in use */
l_nIPARM[16] = 0; /* Not in use */
l_nIPARM[17] = 0; /* Output: Number of nonzeros in the factor LU */
l_nIPARM[18] = 0; /* Output: Mflops for LU factorization */
l_nIPARM[19] = 0; /* Output: Numbers of CG Iterations */

l_nMAXFCT = 1; /* Maximum number of numerical factorizations. */
l_nMNUM = 1; /* Which factorization to use. */
l_nMSGLVL = 0; /* do not Print statistical information in file */
l_nERROR = 0; /* Initialize l_nERROR flag */
/* -------------------------------------------------------------------- */
/* .. Initialize the internal solver memory pointer. This is only */
/* necessary for the FIRST call of the PARDISO solver. */
/* -------------------------------------------------------------------- */
for (int i = 0; i < 64; i++)
l_pMemPt = 0;
/* -------------------------------------------------------------------- */
/* .. Reordering and Symbolic Factorization. This step also allocates */
/* all memory that is necessary for the factorization. */
/* -------------------------------------------------------------------- */
l_nPHASE = 11;
PARDISO (l_pMemPt, &l_nMAXFCT, &l_nMNUM, &l_nMTYPE, &l_nPHASE,
&m_nNodeCnt, l_pAX, l_pRowIdx, l_pCol, &l_nIDUM, &l_nNRHS,
l_nIPARM, &l_nMSGLVL, &l_dDDUM, &l_dDDUM, &l_nERROR);

INPUT AND CONTROL FLAG FOR UNSYMMETRIC
/* -------------------------------------------------------------------- */
/* .. Initialize the internal solver memory pointer. This is only */
/* necessary for the FIRST call of the PARDISO solver. */
/* -------------------------------------------------------------------- */
m_pMemPt = new MEMPTR[64];
for (int i = 0; i < 64; i++)
m_pMemPt = 0;

m_nMTYPE = 1; /* Real structurally symmetric matrix */
/* RHS and solution vectors. */
m_nNRHS = 1; /* Number of right hand sides. */
/* Pardiso control parameters. */
m_nIPARM = new MKL_INT[64];

/* -------------------------------------------------------------------- */
/* .. Pardiso control parameters. */
/* -------------------------------------------------------------------- */
for(int i = 0; i < 64; i++)
m_nIPARM = 0;
m_nIPARM[0] = 1; /* No solver default */
m_nIPARM[1] = 2; /* Fill-in reordering from METIS the solver uses the nested dissection algorithm from the METIS */
/* Numbers of processors, value of MKL_NUM_THREADS. If the variable MKL_NUM_THREADS is not defined, then the solver uses all available processors */
m_nIPARM[2] = 0; //currently is not used
m_nIPARM[3] = 0; /* No iterative-direct algorithm */
m_nIPARM[4] = 0; /* No user fill-in reducing permutation */
m_nIPARM[5] = 0; /* Write solution into x */
m_nIPARM[6] = 0; /* Not in use */
m_nIPARM[7] = 2; /* Max numbers of iterative refinement steps */
m_nIPARM[8] = 0; /* This parameter is reserved for future use. Its value must be set to 0 */
m_nIPARM[9] = 13; /* Perturb the pivot elements with 1E-13 */
m_nIPARM[10] = 1; /* Use nonsymmetric permutation and scaling MPS */
m_nIPARM[11] = 0; /* PARDISO solves a linear system Ax = b (default value). */
m_nIPARM[12] = 1; /* Maximum weighted matching algorithm is switched-off (default for symmetric). Try m_nIPARM[12] = 1 in case of inappropriate accuracy */
m_nIPARM[13] = 0; /* Output: Number of perturbed pivots */
m_nIPARM[14] = 0; /* Not in use */
m_nIPARM[15] = 0; /* Not in use */
m_nIPARM[16] = 0; /* Not in use */
m_nIPARM[17] = 0; /* Output: Number of nonzeros in the factor LU */
m_nIPARM[18] = 0; /* Output: Mflops for LU factorization */
m_nIPARM[19] = 0; /* Output: Numbers of CG Iterations */

m_nMAXFCT = 1; /* Maximum number of numerical factorizations. */
m_nMNUM = 1; /* Which factorization to use. */
m_nMSGLVL = 0; /* do not Print statistical information in file */
m_nERROR = 0; /* Initialize m_nERROR flag */

m_nPHASE = 11;
PARDISO (m_pMemPt, &m_nMAXFCT, &m_nMNUM, &m_nMTYPE, &m_nPHASE,
&m_nXSize, m_dAX, m_nRowIdx, m_nCol, &m_nIDUM, &m_nNRHS,
m_nIPARM, &m_nMSGLVL, &m_dDDUM, &m_dDDUM, &m_nERROR);

Alexander_K_Intel2 · ‎10-13-2011

Hi,

Sorry for delay. It's really strange that PARDISO return number of threads is equal to 6 on your 4 cores Xeon. Did you change default number of threads used by MKL?

With best regards,

Alexander Kalinkin

negi__ashish · ‎10-15-2011

Hi Alexander,

Thanks for reply.

Yes. When I first reported this in forum I had changed MKL_NUM_THREADS=4 on this 6-core machine. Later I removed it and thats why you see 6-cores printed by PARDISO.

Thanks,
Ashish

mecej4 · ‎10-15-2011

I see a significant reason for the slowness of the solver. You have about 3 million nonzeros in A, but 90 times that number of nonzeros in the L and U factors of A. Solving N equations with Z nonzero entries may be estimated to take (N.Z) operations, but the fill-in is causing a drastic slowdown.

It indicates the advisability of putting some effort into examining whether the equations have a band structure and whether that bandwidth can be reduced by reordering.

Some background information, on how the equations were originated, may help.

Alexander_K_Intel2 · ‎10-15-2011

Hi Ashish,

One additional question: What kind of solver from Ansys so you use?

With best regards,

Alexander Kalinkin

negi__ashish · ‎10-17-2011

Hi mecej4,

Thanks for reply.

These equations are derived from finite element formulation of heat conduction equation using 4 noded TET. Heat conduction equations have convection boundary condition defined for all the boundaries.

Actually, I do use reverse cuthill methd to reduce band width of matrix and all the data I had shared earlier is after bandwidth minimization. I have seen reduction in computaional time after minimization of bandwidth. I am not sure if it will always be benficial to reduce the bandwidth because it may not be useful for sprase solver.

Thanks,
Ashish

negi__ashish · ‎10-17-2011

Hi Alexander,

I use Ansys Mechanical to solve nonlinear equations. I chose Sparse Solver and Newton Raphson from the option list of Ansys.

Thanks,
Ashish