Memory cost and efficiency of Pardiso

danielsue · ‎04-19-2013

Hi All,

I have implemented Pardiso to solve my problem and it works fine. But the memory cost and efficiency of Pardiso is not outstanding. We have another sparse solver (Solver A, not parallelized) that use level-based preconditioner and CGS-based acceleration.The memory cost for Pardiso is about three to four times of Solver A. For most of my matrix, if I use only one processor for Pardiso, Solver A is much faster, if I use six processors, they are almost the same speed. For some large matrix, Solver A is even faster than Pardiso even if Pardiso uses 6 processors. I compared the runtime of numerical factorization and substitution, and found that solver A is much more efficient. We do not expect Pardiso to beat Solver A if Pardiso runs in serial mode. But we do hope Pardiso to beat Solver A when it runs in parallel mode using 6 processors.

The speedup for Pardiso seems good, I can get a speedup of 4.5 if I use 6 processors.

Here are the parameter I use.

        maxfct = 1
        mnum = 1
        nrhs = 1
        error = 0 ! initialize error flag
        msglvl = 0 ! print statistical information
        mtype = 11 ! real unsymmetric

        iparm= 0
        iparm(1) = 1 ! no solver default
        iparm(2) = 3 ! fill-in reordering from METIS ,0-MIN DEGREE, 2-METIS, 3-OPENMP VERSION
        iparm(3) = 0 ! numbers of processors. Input the next call mkl_set_dynamic(0), mkl_set_num_threads(n);
        iparm(4) = 61 ! 0-no iterative-direct algorithm; 10*L+K, K=1 CGS, K=2 CGS for symmetric, 1.0E-L: stopping criterion
        iparm(5) = 0 ! no user fill-in reducing permutation
        iparm(6) = 0 ! if == 0, the array of b is replaced with the solution x.
        iparm(7) = 0 ! Output, Number of iterative refinement steps performed
        iparm(8) = 9 ! numbers of iterative refinement steps, must be 0 if a solution is calculated with separate substitutions
        iparm(9) = 0 ! not in use
        iparm(10) = 13 ! Default value 13, perturbe the pivot elements with 1E-13
        iparm(11) = 1 ! use nonsymmetric permutation and scaling MPS
        iparm(12) = 0 ! not in use
        iparm(13) = 1 ! maximum weighted matching algorithm is switched-on (default for non-symmetric)
        iparm(14) = 0 ! Output: number of perturbed pivots
        iparm(15) = 0 ! Output, Peak memory on symbolic factorization.
        iparm(16) = 0 ! Output, Permanent memory on symbolic factorization. This value is only computed in phase 1.
        iparm(17) = 0 ! Output, Size of factors/Peak memory on numerical factorization and solution.
        iparm(18) = 0 ! Input/output. Report the number of non-zero elements in the factors. >= 0 Disable reporting.
        iparm(19) = 0 ! Input/output. Report number of floating point operations to factor matrix A. >= 0 Disable reporting.
        iparm(20) = 0 ! Output: Numbers of CG Iterations. >0 CGS succeeded, reports the number of completed iterations.
        iparm(24) = 1 ! Parallel factorization control, 0: classic algorithm, 1: two-level factorization algorithm, improve scalability on many threads.
        iparm(25) = 0 ! Parallel forward/backward solve control. 0: Use parallel algorithm for the solve step; 1: Use the sequential forward/backward solve.

I tried to change iparm(4) and iparm(10), that does not make much difference. How can I improve the efficiency of Pardiso?

Thanks and best regards,

Daniel

danielsue · ‎04-19-2013

The codes are as follows:

        phase = 11 ! only reordering and symbolic factorization

        call pardiso (pt_in, maxfct, mnum, mtype, phase, n_in, a_in, ia_in, ja_in, &
        idum, nrhs, iparm_in, msglvl, ddum, ddum, error)

!Newton loop

        phase = 22 ! only factorization
        call pardiso (pt_in, maxfct, mnum, mtype, phase, n_in, a_in, ia_in, ja_in, &
        idum, nrhs, iparm_in, msglvl, ddum, ddum, error)

        phase = 33 ! only substitution

       call pardiso (pt_in, maxfct, mnum, mtype, phase, n_in, a_in, ia_in, ja_in, &
            idum, nrhs, iparm_in, msglvl, b_in, x_out, error)

       IF THE RESULT IS NOT GOOD, CALL SYBMOLIC FACTORIZATION AGAIN.

!Newton loop

        phase = -1 ! release internal memory
        call pardiso (pt_in, maxfct, mnum, mtype, phase, n_in, ddum, idum,    &
        idum, idum, nrhs, iparm, msglvl, ddum, ddum, error)

I did not release memory every step in Newton loop, is this necessary?

danielsue · ‎04-19-2013

danielsue wrote:

Hi All,

I have implemented Pardiso to solve my problem and it works fine. But the memory cost and efficiency of Pardiso is not outstanding. We have another sparse solver (Solver A, not parallelized) that use level-based preconditioner and CGS-based acceleration.The memory cost for Pardiso is about three to four times of Solver A. For most of my matrix, if I use only one processor for Pardiso, Solver A is much faster, if I use six processors, they are almost the same speed. For some large matrix, Solver A is even faster than Pardiso even if Pardiso uses 6 processors. I compared the runtime of numerical factorization and substitution, and found that solver A is much more efficient. We do not expect Pardiso to beat Solver A if Pardiso runs in serial mode. But we do hope Pardiso to beat Solver A when it runs in parallel mode using 6 processors.

The speedup for Pardiso seems good, I can get a speedup of 4.5 if I use 6 processors.

Here are the parameter I use.

        maxfct = 1
        mnum = 1
        nrhs = 1
        error = 0 ! initialize error flag
        msglvl = 0 ! print statistical information
        mtype = 11 ! real unsymmetric

        iparm= 0
        iparm(1) = 1 ! no solver default
        iparm(2) = 3 ! fill-in reordering from METIS ,0-MIN DEGREE, 2-METIS, 3-OPENMP VERSION
        iparm(3) = 0 ! numbers of processors. Input the next call mkl_set_dynamic(0), mkl_set_num_threads(n);
        iparm(4) = 61 ! 0-no iterative-direct algorithm; 10*L+K, K=1 CGS, K=2 CGS for symmetric, 1.0E-L: stopping criterion
        iparm(5) = 0 ! no user fill-in reducing permutation
        iparm(6) = 0 ! if == 0, the array of b is replaced with the solution x.
        iparm(7) = 0 ! Output, Number of iterative refinement steps performed
        iparm(8) = 9 ! numbers of iterative refinement steps, must be 0 if a solution is calculated with separate substitutions
        iparm(9) = 0 ! not in use
        iparm(10) = 13 ! Default value 13, perturbe the pivot elements with 1E-13
        iparm(11) = 1 ! use nonsymmetric permutation and scaling MPS
        iparm(12) = 0 ! not in use
        iparm(13) = 1 ! maximum weighted matching algorithm is switched-on (default for non-symmetric)
        iparm(14) = 0 ! Output: number of perturbed pivots
        iparm(15) = 0 ! Output, Peak memory on symbolic factorization.
        iparm(16) = 0 ! Output, Permanent memory on symbolic factorization. This value is only computed in phase 1.
        iparm(17) = 0 ! Output, Size of factors/Peak memory on numerical factorization and solution.
        iparm(18) = 0 ! Input/output. Report the number of non-zero elements in the factors. >= 0 Disable reporting.
        iparm(19) = 0 ! Input/output. Report number of floating point operations to factor matrix A. >= 0 Disable reporting.
        iparm(20) = 0 ! Output: Numbers of CG Iterations. >0 CGS succeeded, reports the number of completed iterations.
        iparm(24) = 1 ! Parallel factorization control, 0: classic algorithm, 1: two-level factorization algorithm, improve scalability on many threads.
        iparm(25) = 0 ! Parallel forward/backward solve control. 0: Use parallel algorithm for the solve step; 1: Use the sequential forward/backward solve.

I tried to change iparm(4) and iparm(10), that does not make much difference. How can I improve the efficiency of Pardiso?

Thanks and best regards,

Daniel

The codes are as follows:

!codes start
        phase = 11 ! only reordering and symbolic factorization

        call pardiso (pt_in, maxfct, mnum, mtype, phase, n_in, a_in, ia_in, ja_in, &
        idum, nrhs, iparm_in, msglvl, ddum, ddum, error)

!Newton loop

        phase = 22 ! only factorization
        call pardiso (pt_in, maxfct, mnum, mtype, phase, n_in, a_in, ia_in, ja_in, &
        idum, nrhs, iparm_in, msglvl, ddum, ddum, error)

        phase = 33 ! only substitution

       call pardiso (pt_in, maxfct, mnum, mtype, phase, n_in, a_in, ia_in, ja_in, &
            idum, nrhs, iparm_in, msglvl, b_in, x_out, error)

       IF THE RESULT IS NOT GOOD, CALL SYBMOLIC FACTORIZATION AGAIN.

!Newton loop

        phase = -1 ! release internal memory
        call pardiso (pt_in, maxfct, mnum, mtype, phase, n_in, ddum, idum,    &
        idum, idum, nrhs, iparm, msglvl, ddum, ddum, error)

!codes end

I didn't release memory every newton loop, is this necesary?

Alexander_K_Intel2 · ‎04-19-2013

Hi Daniel,

Nice to hear you again. :) First of all I will try to switch on parallel solving. Also I see that you turn on iterative algorithm instead of factorization step, did you try to switch it off (iparm(4) = 0)? Also it make sense to obtain solution on each nonlinear iteration by internal iterative routine, for example you can make several iteration of fgmres method using current matrix as stiffness matrix and solving step of pardiso with LU decomposition of previous matrix as preconditioner. Solution from previous non-linear iterative step could be initial solution for internal iterative algorithm. But, as i wrote before, in such case we modified algorithm on nonlinear solver and nobody can guarantee that solutions will be the same :) Does this make sense?

With best regards,

Alexander Kalinkin

P.S. Also could you provide any information about solver A?

danielsue · ‎04-19-2013

Alexander Kalinkin (Intel) wrote:

Hi Daniel,

Nice to hear you again. :) First of all I will try to switch on parallel solving. Also I see that you turn on iterative algorithm instead of factorization step, did you try to switch it off (iparm(4) = 0)? Also it make sense to obtain solution on each nonlinear iteration by internal iterative routine, for example you can make several iteration of fgmres method using current matrix as stiffness matrix and solving step of pardiso with LU decomposition of previous matrix as preconditioner. Solution from previous non-linear iterative step could be initial solution for internal iterative algorithm. But, as i wrote before, in such case we modified algorithm on nonlinear solver and nobody can guarantee that solutions will be the same :) Does this make sense?

With best regards,

Alexander Kalinkin

P.S. Also could you provide any information about solver A?

Hi Alexander,

Thanks for the quick reply.

I also tried to turn off iterative algorithm (iparm(4) = 0), this does not guarantee higher speed. Sometimes it is faster when iparm(4) = 0. Currently I didn't modify algorithm on nonlinear solver

SolverA is Waterloo Sparese, Iterative Matrix Solver. I attached a user's guide and notes for this solver.

Thanks and regards,

Daniel

danielsue · ‎04-19-2013

Please find the attached pdf file for the introduction of Solver A.