Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Intel Community
- Software
- Software Development SDKs and Libraries
- Intel® oneAPI Math Kernel Library
- performance of MKL pardiso

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

negi__ashish

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

10-12-2011
02:36 AM

223 Views

performance of MKL pardiso

My test model comprises of 370000x370000 size coeff and jacobian matrix. In single iteration, both set take around 36 and 65 sec respectively for phases 11,22 and 33 . System configuration is Xeon 3.46 GHz 4-core with 20GB RAM.

I would like to solve one iteration of at least 1millionx1million size model in 60 sec. including both symmetric and unsymmetric set. When I ran Ansys on same system with this size it hardly took 30 sec to solve 4 iterations (each iteration include one symmetric and one unsymmetric set). Any suggestion to increase the performance will be greatly appreciated.

Thanks,

Ashish

Link Copied

10 Replies

Konstantin_A_Intel

Employee

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

10-12-2011
02:42 AM

223 Views

Please make sure you are running threaded version of MKL. You should link with libmkl_intel_thread library for this, please refer to MKl link line adviser:

http://software.intel.com/en-us/articles/intel-mkl-link-line-advisor/

You may also look how many threads was used by PARDISO by setting msglvl=1.

Regards,

Konstantin

Alexander_K_Intel2

Employee

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

10-12-2011
02:43 AM

223 Views

Could you set msglvl to 1 and include PARDISO output to this topic? It could help us to provide you some advices.

With best regards,

Alexander Kalinkin

negi__ashish

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

10-12-2011
04:08 AM

223 Views

I am using threaded MKL and my OS is windows xp. I have following libs included in my project

mkl_solver_lp64.lib

mkl_intel_lp64.lib

mkl_intel_thread.lib

mkl_core.lib

libiomp5md.lib

Here is output for unsymmetric jacobian

The local (internal) PARDISO version is : 103000116

1-based array indexing is turned ON

PARDISO double precision computation is turned ON

METIS algorithm at reorder step is turned ON

Scaling is turned ON

Matching is turned ON

Summary: ( reordering phase )

================

Times:

======

Time spent in calculations of symmetric matrix portrait(fulladj): 0.009621 s

Time spent in reordering of the initial matrix(reorder) : 2.682449 s

Time spent in symbolic factorization(symbfct) : 0.701643 s

Time spent in data preparations for factorization(parlist) : 0.033879 s

Time spent in allocation of internal data structures(malloc) : 0.066773 s

Time spent in additional calculations : 0.336471 s

Total time spent : 3.830835 s

Statistics:

===========

< Parallel Direct Factorization with #processors: > 6

< Numerical Factorization with BLAS3 and O(n) synchronization >

< Linear system Ax = b>

#equations: 372745

#non-zeros in A: 5495763

non-zeros in A (): 0.003956

#right-hand sides: 1

< Factors L and U >

#columns for each panel: 72

#independent subgraphs: 0

< Preprocessing with state of the art partitioning metis>

#supernodes: 154074

size of largest supernode: 3960

number of nonzeros in L 268393909

number of nonzeros in U 260128570

number of nonzeros in L+U 528522479

=== PARDISO is running in In-Core mode, because iparam(60)=0 ===

Percentage of computed non-zeros for LL^T factorization

0 %

1 %

2 %

3 %

4 %

5 %

6 %

7 %

8 %

9 %

10 %

11 %

12 %

13 %

14 %

15 %

16 %

17 %

18 %

19 %

20 %

21 %

22 %

23 %

24 %

25 %

26 %

27 %

28 %

29 %

30 %

31 %

32 %

33 %

34 %

35 %

36 %

37 %

38 %

39 %

40 %

41 %

42 %

43 %

44 %

45 %

46 %

47 %

48 %

49 %

50 %

51 %

52 %

53 %

54 %

55 %

56 %

57 %

58 %

59 %

60 %

61 %

62 %

63 %

64 %

65 %

66 %

67 %

68 %

69 %

70 %

71 %

73 %

74 %

75 %

76 %

77 %

78 %

79 %

80 %

81 %

82 %

83 %

84 %

85 %

86 %

87 %

88 %

89 %

90 %

91 %

92 %

93 %

94 %

95 %

96 %

97 %

98 %

99 %

100 %

================ PARDISO: solving a real struct. sym. system ================

Single-level factorization algorithm is turned ON

Summary: ( factorization phase )

================

Times:

======

Time spent in copying matrix to internal data structure(A to LU): 0.000000 s

Time spent in factorization step(numfct) : 51.146111 s

Time spent in allocation of internal data structures(malloc) : 0.000339 s

Time spent in additional calculations : 0.000001 s

Total time spent : 51.146451 s

Statistics:

===========

< Parallel Direct Factorization with #processors: > 6

< Numerical Factorization with BLAS3 and O(n) synchronization >

< Linear system Ax = b>

#equations: 372745

#non-zeros in A: 5495763

non-zeros in A (): 0.003956

#right-hand sides: 1

< Factors L and U >

#columns for each panel: 72

#independent subgraphs: 0

< Preprocessing with state of the art partitioning metis>

#supernodes: 154074

size of largest supernode: 3960

number of nonzeros in L 268393909

number of nonzeros in U 260128570

number of nonzeros in L+U 528522479

gflop for the numerical factorization: 1708.338114

gflop/s for the numerical factorization: 33.401134

================ PARDISO: solving a real struct. sym. system ================

Summary: ( solution phase )

================

Times:

======

Time spent in direct solver at solve step (solve) : 0.316364 s

Time spent in additional calculations : 0.631197 s

Total time spent : 0.947561 s

Statistics:

===========

< Parallel Direct Factorization with #processors: > 6

< Numerical Factorization with BLAS3 and O(n) synchronization >

< Linear system Ax = b>

#equations: 372745

#non-zeros in A: 5495763

non-zeros in A (): 0.003956

#right-hand sides: 1

< Factors L and U >

#columns for each panel: 72

#independent subgraphs: 0

< Preprocessing with state of the art partitioning metis>

#supernodes: 154074

size of largest supernode: 3960

number of nonzeros in L 268393909

number of nonzeros in U 260128570

number of nonzeros in L+U 528522479

gflop for the numerical factorization: 1708.338114

gflop/s for the numerical factorization: 33.401134

I could not print same for symmteric coefficient matrix as output string length exceeded limit of windows command window. Here is partial output. I will try some other way to print it

Times:

======

Time spent in copying matrix to internal data structure(A to LU): 0.000000 s

Time spent in factorization step(numfct) : 23.408378 s

Time spent in allocation of internal data structures(malloc) : 0.000514 s

Time spent in additional calculations : 0.000001 s

Total time spent : 23.408893 s

Statistics:

===========

< Parallel Direct Factorization with #processors: > 6

< Numerical Factorization with BLAS3 and O(n) synchronization >

< Linear system Ax = b>

#equations: 372745

#non-zeros in A: 2934254

non-zeros in A (): 0.002112

#right-hand sides: 1

< Factors L and U >

#columns for each panel: 96

#independent subgraphs: 0

< Preprocessing with state of the art partitioning metis>

#supernodes: 153764

size of largest supernode: 3960

number of nonzeros in L 269145349

number of nonzeros in U 1

number of nonzeros in L+U 269145350

gflop for the numerical factorization: 869.989754

gflop/s for the numerical factorization: 37.165742

================ PARDISO: solving a symm. posit. def. system ================

Summary: ( solution phase )

================

Times:

======

Time spent in direct solver at solve step (solve) : 0.298444 s

Time spent in additional calculations : 0.619958 s

Total time spent : 0.918402 s

Statistics:

===========

< Parallel Direct Factorization with #processors: > 6

< Numerical Factorization with BLAS3 and O(n) synchronization >

< Linear system Ax = b>

#equations: 372745

#non-zeros in A: 2934254

non-zeros in A (): 0.002112

#right-hand sides: 1

< Factors L and U >

#columns for each panel: 96

#independent subgraphs: 0

< Preprocessing with state of the art partitioning metis>

#supernodes: 153764

size of largest supernode: 3960

number of nonzeros in L 269145349

number of nonzeros in U 1

number of nonzeros in L+U 269145350

gflop for the numerical factorization: 869.989754

gflop/s for the numerical factorization: 37.165742

negi__ashish

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

10-12-2011
07:14 AM

223 Views

I am copying part of code which set the input and control flags for both symmetric and unsymmteric solver. This may be helpful.

Thanks,

Ashish

INPUT AND CONTROL FLAG FOR SYMMETRIC

/* RHS and solution vectors. */

MKL_INT l_nNRHS = 1; /* Number of right hand sides. */

/* Internal solver memory pointer l_pMemPt, */

/* 32-bit: int l_pMemPt[64]; 64-bit: long int l_pMemPt[64] */

/* or void *l_pMemPt[64] should be OK on both architectures */

#ifdef WIN64

long int l_pMemPt[64];

#else

int l_pMemPt[64];

#endif

/* Pardiso control parameters. */

MKL_INT l_nIPARM[64];

MKL_INT l_nMAXFCT, l_nMNUM, l_nPHASE, l_nERROR, l_nMSGLVL;

/* Auxiliary variables. */

double l_dDDUM; /* Double dummy */

MKL_INT l_nIDUM; /* Integer dummy. */

/* -------------------------------------------------------------------- */

/* .. Setup Pardiso control parameters. */

/* -------------------------------------------------------------------- */

for(int i = 0; i < 64; i++)

l_nIPARM

l_nIPARM[0] = 1; /* No solver default */

l_nIPARM[1] = 2; /* Fill-in reordering from METIS */

/* Numbers of processors, value of OMP_NUM_THREADS */

l_nIPARM[2] = 0; //use all the available processors

l_nIPARM[3] = 0; /* No iterative-direct algorithm */

l_nIPARM[4] = 0; /* No user fill-in reducing permutation */

l_nIPARM[5] = 0; /* Write solution into x */

l_nIPARM[6] = 0; /* Not in use */

l_nIPARM[7] = 2; /* Max numbers of iterative refinement steps */

l_nIPARM[8] = 0; /* Not in use */

l_nIPARM[9] = 13; /* Perturb the pivot elements with 1E-13 */

l_nIPARM[10] = 1; /* Use nonsymmetric permutation and scaling MPS */

l_nIPARM[11] = 0; /* Not in use */

l_nIPARM[12] = 0; /* Maximum weighted matching algorithm is switched-off (default for symmetric). Try l_nIPARM[12] = 1 in case of inappropriate accuracy */

l_nIPARM[13] = 0; /* Output: Number of perturbed pivots */

l_nIPARM[14] = 0; /* Not in use */

l_nIPARM[15] = 0; /* Not in use */

l_nIPARM[16] = 0; /* Not in use */

l_nIPARM[17] = 0; /* Output: Number of nonzeros in the factor LU */

l_nIPARM[18] = 0; /* Output: Mflops for LU factorization */

l_nIPARM[19] = 0; /* Output: Numbers of CG Iterations */

l_nMAXFCT = 1; /* Maximum number of numerical factorizations. */

l_nMNUM = 1; /* Which factorization to use. */

l_nMSGLVL = 0; /* do not Print statistical information in file */

l_nERROR = 0; /* Initialize l_nERROR flag */

/* -------------------------------------------------------------------- */

/* .. Initialize the internal solver memory pointer. This is only */

/* necessary for the FIRST call of the PARDISO solver. */

/* -------------------------------------------------------------------- */

for (int i = 0; i < 64; i++)

l_pMemPt

/* -------------------------------------------------------------------- */

/* .. Reordering and Symbolic Factorization. This step also allocates */

/* all memory that is necessary for the factorization. */

/* -------------------------------------------------------------------- */

l_nPHASE = 11;

PARDISO (l_pMemPt, &l_nMAXFCT, &l_nMNUM, &l_nMTYPE, &l_nPHASE,

&m_nNodeCnt, l_pAX, l_pRowIdx, l_pCol, &l_nIDUM, &l_nNRHS,

l_nIPARM, &l_nMSGLVL, &l_dDDUM, &l_dDDUM, &l_nERROR);

INPUT AND CONTROL FLAG FOR UNSYMMETRIC

/* .. Initialize the internal solver memory pointer. This is only */

/* necessary for the FIRST call of the PARDISO solver. */

/* -------------------------------------------------------------------- */

m_pMemPt = new MEMPTR[64];

for (int i = 0; i < 64; i++)

m_pMemPt

m_nMTYPE = 1; /* Real structurally symmetric matrix */

/* RHS and solution vectors. */

m_nNRHS = 1; /* Number of right hand sides. */

/* Pardiso control parameters. */

m_nIPARM = new MKL_INT[64];

/* -------------------------------------------------------------------- */

/* .. Pardiso control parameters. */

/* -------------------------------------------------------------------- */

for(int i = 0; i < 64; i++)

m_nIPARM

m_nIPARM[0] = 1; /* No solver default */

m_nIPARM[1] = 2; /* Fill-in reordering from METIS the solver uses the nested dissection algorithm from the METIS */

/* Numbers of processors, value of MKL_NUM_THREADS. If the variable MKL_NUM_THREADS is not defined, then the solver uses all available processors */

m_nIPARM[2] = 0; //currently is not used

m_nIPARM[3] = 0; /* No iterative-direct algorithm */

m_nIPARM[4] = 0; /* No user fill-in reducing permutation */

m_nIPARM[5] = 0; /* Write solution into x */

m_nIPARM[6] = 0; /* Not in use */

m_nIPARM[7] = 2; /* Max numbers of iterative refinement steps */

m_nIPARM[8] = 0; /* This parameter is reserved for future use. Its value must be set to 0 */

m_nIPARM[9] = 13; /* Perturb the pivot elements with 1E-13 */

m_nIPARM[10] = 1; /* Use nonsymmetric permutation and scaling MPS */

m_nIPARM[11] = 0; /* PARDISO solves a linear system Ax = b (default value). */

m_nIPARM[12] = 1; /* Maximum weighted matching algorithm is switched-off (default for symmetric). Try m_nIPARM[12] = 1 in case of inappropriate accuracy */

m_nIPARM[13] = 0; /* Output: Number of perturbed pivots */

m_nIPARM[14] = 0; /* Not in use */

m_nIPARM[15] = 0; /* Not in use */

m_nIPARM[16] = 0; /* Not in use */

m_nIPARM[17] = 0; /* Output: Number of nonzeros in the factor LU */

m_nIPARM[18] = 0; /* Output: Mflops for LU factorization */

m_nIPARM[19] = 0; /* Output: Numbers of CG Iterations */

m_nMAXFCT = 1; /* Maximum number of numerical factorizations. */

m_nMNUM = 1; /* Which factorization to use. */

m_nMSGLVL = 0; /* do not Print statistical information in file */

m_nERROR = 0; /* Initialize m_nERROR flag */

m_nPHASE = 11;

PARDISO (m_pMemPt, &m_nMAXFCT, &m_nMNUM, &m_nMTYPE, &m_nPHASE,

&m_nXSize, m_dAX, m_nRowIdx, m_nCol, &m_nIDUM, &m_nNRHS,

m_nIPARM, &m_nMSGLVL, &m_dDDUM, &m_dDDUM, &m_nERROR);

Alexander_K_Intel2

Employee

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

10-13-2011
10:39 PM

223 Views

Sorry for delay. It's really strange that PARDISO return number of threads is equal to 6 on your 4 cores Xeon. Did you change default number of threads used by MKL?

With best regards,

Alexander Kalinkin

negi__ashish

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

10-15-2011
04:44 AM

223 Views

Thanks for reply.

Yes. When I first reported this in forum I had changed MKL_NUM_THREADS=4 on this 6-core machine. Later I removed it and thats why you see 6-cores printed by PARDISO.

Thanks,

Ashish

mecej4

Black Belt

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

10-15-2011
07:33 AM

223 Views

It indicates the advisability of putting some effort into examining whether the equations have a band structure and whether that bandwidth can be reduced by reordering.

Some background information, on how the equations were originated, may help.

Alexander_K_Intel2

Employee

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

10-15-2011
11:08 PM

223 Views

One additional question: What kind of solver from Ansys so you use?

With best regards,

Alexander Kalinkin

negi__ashish

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

10-17-2011
12:09 AM

223 Views

Thanks for reply.

These equations are derived from finite element formulation of heat conduction equation using 4 noded TET. Heat conduction equations have convection boundary condition defined for all the boundaries.

Actually, I do use reverse cuthill methd to reduce band width of matrix and all the data I had shared earlier is after bandwidth minimization. I have seen reduction in computaional time after minimization of bandwidth. I am not sure if it will always be benficial to reduce the bandwidth because it may not be useful for sprase solver.

Thanks,

Ashish

negi__ashish

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

10-17-2011
12:19 AM

223 Views

I use Ansys Mechanical to solve nonlinear equations. I chose Sparse Solver and Newton Raphson from the option list of Ansys.

Thanks,

Ashish

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

For more complete information about compiler optimizations, see our Optimization Notice.