Showing results for

- Intel Community
- Software Development SDKs and Libraries
- Intel® oneAPI Math Kernel Library & Intel® Math Kernel Library
- Memory cost and efficiency of Pardiso

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

Highlighted
##

danielsue

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

04-19-2013
09:38 AM

11 Views

Memory cost and efficiency of Pardiso

Hi All,

I have implemented Pardiso to solve my problem and it works fine. But the memory cost and efficiency of Pardiso is not outstanding. We have another sparse solver (**Solver A, not parallelized**) that use level-based preconditioner and CGS-based acceleration.The memory cost for Pardiso is about three to four times of Solver A. For most of my matrix, if I use only one processor for Pardiso, Solver A is much faster, if I use six processors, they are almost the same speed. For some large matrix, Solver A is even faster than Pardiso even if Pardiso uses 6 processors. I compared the runtime of numerical factorization and substitution, and found that solver A is much more efficient. We do not expect Pardiso to beat Solver A if Pardiso runs in serial mode. But we do hope Pardiso to beat Solver A when it runs in parallel mode using 6 processors.

The speedup for Pardiso seems good, I can get a speedup of 4.5 if I use 6 processors.

Here are the parameter I use.

maxfct = 1

mnum = 1

nrhs = 1

error = 0 ! initialize error flag

msglvl = 0 ! print statistical information

mtype = 11 ! real unsymmetric

iparm= 0

iparm(1) = 1 ! no solver default

iparm(2) = 3 ! fill-in reordering from METIS ,0-MIN DEGREE, 2-METIS, 3-OPENMP VERSION

iparm(3) = 0 ! numbers of processors. Input the next call mkl_set_dynamic(0), mkl_set_num_threads(n);

iparm(4) = 61 ! 0-no iterative-direct algorithm; 10*L+K, K=1 CGS, K=2 CGS for symmetric, 1.0E-L: stopping criterion

iparm(5) = 0 ! no user fill-in reducing permutation

iparm(6) = 0 ! if == 0, the array of b is replaced with the solution x.

iparm(7) = 0 ! Output, Number of iterative refinement steps performed

iparm(8) = 9 ! numbers of iterative refinement steps, must be 0 if a solution is calculated with separate substitutions

iparm(9) = 0 ! not in use

iparm(10) = 13 ! Default value 13, perturbe the pivot elements with 1E-13

iparm(11) = 1 ! use nonsymmetric permutation and scaling MPS

iparm(12) = 0 ! not in use

iparm(13) = 1 ! maximum weighted matching algorithm is switched-on (default for non-symmetric)

iparm(14) = 0 ! Output: number of perturbed pivots

iparm(15) = 0 ! Output, Peak memory on symbolic factorization.

iparm(16) = 0 ! Output, Permanent memory on symbolic factorization. This value is only computed in phase 1.

iparm(17) = 0 ! Output, Size of factors/Peak memory on numerical factorization and solution.

iparm(18) = 0 ! Input/output. Report the number of non-zero elements in the factors. >= 0 Disable reporting.

iparm(19) = 0 ! Input/output. Report number of floating point operations to factor matrix A. >= 0 Disable reporting.

iparm(20) = 0 ! Output: Numbers of CG Iterations. >0 CGS succeeded, reports the number of completed iterations.

iparm(24) = 1 ! Parallel factorization control, 0: classic algorithm, 1: two-level factorization algorithm, improve scalability on many threads.

iparm(25) = 0 ! Parallel forward/backward solve control. 0: Use parallel algorithm for the solve step; 1: Use the sequential forward/backward solve.

I tried to change iparm(4) and iparm(10), that does not make much difference. How can I improve the efficiency of Pardiso?

Thanks and best regards,

Daniel

5 Replies

Highlighted
##

danielsue

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

04-19-2013
09:51 AM

11 Views

The codes are as follows:

phase = 11 ! only reordering and symbolic factorization

call pardiso (pt_in, maxfct, mnum, mtype, phase, n_in, a_in, ia_in, ja_in, &

idum, nrhs, iparm_in, msglvl, ddum, ddum, error)

!Newton loop

phase = 22 ! only factorization

call pardiso (pt_in, maxfct, mnum, mtype, phase, n_in, a_in, ia_in, ja_in, &

idum, nrhs, iparm_in, msglvl, ddum, ddum, error)

phase = 33 ! only substitution

call pardiso (pt_in, maxfct, mnum, mtype, phase, n_in, a_in, ia_in, ja_in, &

idum, nrhs, iparm_in, msglvl, b_in, x_out, error)

IF THE RESULT IS NOT GOOD, CALL SYBMOLIC FACTORIZATION AGAIN.

!Newton loop

phase = -1 ! release internal memory

call pardiso (pt_in, maxfct, mnum, mtype, phase, n_in, ddum, idum, &

idum, idum, nrhs, iparm, msglvl, ddum, ddum, error)

I did not release memory every step in Newton loop, is this necessary?

Highlighted
##

danielsue

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

04-19-2013
09:54 AM

11 Views

danielsue wrote:

Hi All,

I have implemented Pardiso to solve my problem and it works fine. But the memory cost and efficiency of Pardiso is not outstanding. We have another sparse solver (

Solver A, not parallelized) that use level-based preconditioner and CGS-based acceleration.The memory cost for Pardiso is about three to four times of Solver A. For most of my matrix, if I use only one processor for Pardiso, Solver A is much faster, if I use six processors, they are almost the same speed. For some large matrix, Solver A is even faster than Pardiso even if Pardiso uses 6 processors. I compared the runtime of numerical factorization and substitution, and found that solver A is much more efficient. We do not expect Pardiso to beat Solver A if Pardiso runs in serial mode. But we do hope Pardiso to beat Solver A when it runs in parallel mode using 6 processors.The speedup for Pardiso seems good, I can get a speedup of 4.5 if I use 6 processors.

Here are the parameter I use.

maxfct = 1

mnum = 1

nrhs = 1

error = 0 ! initialize error flag

msglvl = 0 ! print statistical information

mtype = 11 ! real unsymmetriciparm= 0

iparm(1) = 1 ! no solver default

iparm(2) = 3 ! fill-in reordering from METIS ,0-MIN DEGREE, 2-METIS, 3-OPENMP VERSION

iparm(3) = 0 ! numbers of processors. Input the next call mkl_set_dynamic(0), mkl_set_num_threads(n);

iparm(4) = 61 ! 0-no iterative-direct algorithm; 10*L+K, K=1 CGS, K=2 CGS for symmetric, 1.0E-L: stopping criterion

iparm(5) = 0 ! no user fill-in reducing permutation

iparm(6) = 0 ! if == 0, the array of b is replaced with the solution x.

iparm(7) = 0 ! Output, Number of iterative refinement steps performed

iparm(8) = 9 ! numbers of iterative refinement steps, must be 0 if a solution is calculated with separate substitutions

iparm(9) = 0 ! not in use

iparm(10) = 13 ! Default value 13, perturbe the pivot elements with 1E-13

iparm(11) = 1 ! use nonsymmetric permutation and scaling MPS

iparm(12) = 0 ! not in use

iparm(13) = 1 ! maximum weighted matching algorithm is switched-on (default for non-symmetric)

iparm(14) = 0 ! Output: number of perturbed pivots

iparm(15) = 0 ! Output, Peak memory on symbolic factorization.

iparm(16) = 0 ! Output, Permanent memory on symbolic factorization. This value is only computed in phase 1.

iparm(17) = 0 ! Output, Size of factors/Peak memory on numerical factorization and solution.

iparm(18) = 0 ! Input/output. Report the number of non-zero elements in the factors. >= 0 Disable reporting.

iparm(19) = 0 ! Input/output. Report number of floating point operations to factor matrix A. >= 0 Disable reporting.

iparm(20) = 0 ! Output: Numbers of CG Iterations. >0 CGS succeeded, reports the number of completed iterations.

iparm(24) = 1 ! Parallel factorization control, 0: classic algorithm, 1: two-level factorization algorithm, improve scalability on many threads.

iparm(25) = 0 ! Parallel forward/backward solve control. 0: Use parallel algorithm for the solve step; 1: Use the sequential forward/backward solve.I tried to change iparm(4) and iparm(10), that does not make much difference. How can I improve the efficiency of Pardiso?

Thanks and best regards,

Daniel

The codes are as follows:

!codes start

phase = 11 ! only reordering and symbolic factorization

call pardiso (pt_in, maxfct, mnum, mtype, phase, n_in, a_in, ia_in, ja_in, &

idum, nrhs, iparm_in, msglvl, ddum, ddum, error)

!Newton loop

phase = 22 ! only factorization

call pardiso (pt_in, maxfct, mnum, mtype, phase, n_in, a_in, ia_in, ja_in, &

idum, nrhs, iparm_in, msglvl, ddum, ddum, error)

phase = 33 ! only substitution

call pardiso (pt_in, maxfct, mnum, mtype, phase, n_in, a_in, ia_in, ja_in, &

idum, nrhs, iparm_in, msglvl, b_in, x_out, error)

IF THE RESULT IS NOT GOOD, CALL SYBMOLIC FACTORIZATION AGAIN.

!Newton loop

phase = -1 ! release internal memory

call pardiso (pt_in, maxfct, mnum, mtype, phase, n_in, ddum, idum, &

idum, idum, nrhs, iparm, msglvl, ddum, ddum, error)

!codes end

I didn't release memory every newton loop, is this necesary?

Highlighted
##

Alexander_K_Intel2

Employee

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

04-19-2013
10:18 AM

11 Views

Hi Daniel,

Nice to hear you again. :) First of all I will try to switch on parallel solving. Also I see that you turn on iterative algorithm instead of factorization step, did you try to switch it off (iparm(4) = 0)? Also it make sense to obtain solution on each nonlinear iteration by internal iterative routine, for example you can make several iteration of fgmres method using current matrix as stiffness matrix and solving step of pardiso with LU decomposition of previous matrix as preconditioner. Solution from previous non-linear iterative step could be initial solution for internal iterative algorithm. But, as i wrote before, in such case we modified algorithm on nonlinear solver and nobody can guarantee that solutions will be the same :) Does this make sense?

With best regards,

Alexander Kalinkin

P.S. Also could you provide any information about solver A?

Highlighted
##

danielsue

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

04-19-2013
10:41 AM

11 Views

Alexander Kalinkin (Intel) wrote:

Hi Daniel,

Nice to hear you again. :) First of all I will try to switch on parallel solving. Also I see that you turn on iterative algorithm instead of factorization step, did you try to switch it off (iparm(4) = 0)? Also it make sense to obtain solution on each nonlinear iteration by internal iterative routine, for example you can make several iteration of fgmres method using current matrix as stiffness matrix and solving step of pardiso with LU decomposition of previous matrix as preconditioner. Solution from previous non-linear iterative step could be initial solution for internal iterative algorithm. But, as i wrote before, in such case we modified algorithm on nonlinear solver and nobody can guarantee that solutions will be the same :) Does this make sense?

With best regards,

Alexander Kalinkin

P.S. Also could you provide any information about solver A?

Hi Alexander,

Thanks for the quick reply.

I also tried to turn off iterative algorithm (iparm(4) = 0), this does not guarantee higher speed. Sometimes it is faster when iparm(4) = 0. Currently I didn't modify algorithm on nonlinear solver

SolverA is Waterloo Sparese, Iterative Matrix Solver. I attached a user's guide and notes for this solver.

Thanks and regards,

Daniel

Highlighted
##

danielsue

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

04-19-2013
10:44 AM

11 Views

For more complete information about compiler optimizations, see our Optimization Notice.