Intel® oneAPI Math Kernel Library
Ask questions and share information with other developers who use Intel® Math Kernel Library.

Memory cost and efficiency of Pardiso

danielsue
Beginner
360 Views

Hi All,

I have implemented Pardiso to solve my problem and it works fine. But the memory cost and efficiency of Pardiso is not outstanding. We have another sparse solver (Solver A, not parallelized) that use level-based preconditioner and CGS-based acceleration.The memory cost for Pardiso is about three to four times of Solver A. For most of my matrix, if I use only one processor for Pardiso, Solver A is much faster, if I use six processors, they are almost the same speed. For some large matrix, Solver A is even faster than Pardiso even if Pardiso uses 6 processors. I compared the runtime of numerical factorization and substitution, and found that solver A is much more efficient. We do not expect Pardiso to beat Solver A if Pardiso runs in serial mode. But we do hope Pardiso to beat Solver A when it runs in parallel mode using 6 processors.

The speedup for Pardiso seems good, I can get a speedup of 4.5 if I use 6 processors.

Here are the parameter I use.      

        maxfct = 1        
        mnum = 1        
        nrhs = 1        
        error = 0 ! initialize error flag        
        msglvl = 0 ! print statistical information        
        mtype = 11 ! real unsymmetric

        iparm= 0
        iparm(1) = 1 ! no solver default
        iparm(2) = 3 ! fill-in reordering from METIS ,0-MIN DEGREE, 2-METIS, 3-OPENMP VERSION
        iparm(3) = 0 ! numbers of processors. Input the next call mkl_set_dynamic(0), mkl_set_num_threads(n);    
        iparm(4) = 61 ! 0-no iterative-direct algorithm; 10*L+K, K=1 CGS, K=2 CGS for symmetric, 1.0E-L: stopping criterion
        iparm(5) = 0 ! no user fill-in reducing permutation
        iparm(6) = 0 ! if == 0, the array of b is replaced with the solution x.
        iparm(7) = 0 ! Output, Number of iterative refinement steps performed
        iparm(8) = 9 ! numbers of iterative refinement steps, must be 0 if a solution is calculated with separate substitutions
        iparm(9) = 0 ! not in use
        iparm(10) = 13 ! Default value 13, perturbe the pivot elements with 1E-13
        iparm(11) = 1 ! use nonsymmetric permutation and scaling MPS
        iparm(12) = 0 ! not in use
        iparm(13) = 1 ! maximum weighted matching algorithm is switched-on (default for non-symmetric)
        iparm(14) = 0 ! Output: number of perturbed pivots
        iparm(15) = 0 ! Output, Peak memory on symbolic factorization.
        iparm(16) = 0 ! Output, Permanent memory on symbolic factorization. This value is only computed in phase 1.
        iparm(17) = 0 ! Output, Size of factors/Peak memory on numerical factorization and solution.
        iparm(18) = 0 ! Input/output. Report the number of non-zero elements in the factors. >= 0 Disable reporting.
        iparm(19) = 0 ! Input/output. Report number of floating point operations to factor matrix A. >= 0 Disable reporting.
        iparm(20) = 0 ! Output: Numbers of CG Iterations. >0 CGS succeeded, reports the number of completed iterations.
        iparm(24) = 1 ! Parallel factorization control, 0: classic algorithm, 1: two-level factorization algorithm, improve scalability on many threads.
        iparm(25) = 0 ! Parallel forward/backward solve control. 0: Use parallel algorithm for the solve step; 1: Use the sequential forward/backward solve.

I tried to change iparm(4) and iparm(10), that does not make much difference. How can I improve the efficiency of Pardiso?

Thanks and best regards,

Daniel

0 Kudos
5 Replies
danielsue
Beginner
360 Views

The codes are as follows:

        phase = 11 ! only reordering and symbolic factorization  
       
        call pardiso (pt_in, maxfct, mnum, mtype, phase, n_in, a_in, ia_in, ja_in, &
        idum, nrhs, iparm_in, msglvl, ddum, ddum, error)
        
!Newton loop

        phase = 22 ! only factorization
        call pardiso (pt_in, maxfct, mnum, mtype, phase, n_in, a_in, ia_in, ja_in, &
        idum, nrhs, iparm_in, msglvl, ddum, ddum, error)

        phase = 33 ! only substitution
            
        call pardiso (pt_in, maxfct, mnum, mtype, phase, n_in, a_in, ia_in, ja_in, &
            idum, nrhs, iparm_in, msglvl, b_in, x_out, error)

        IF THE RESULT IS NOT GOOD, CALL SYBMOLIC FACTORIZATION AGAIN.

!Newton loop

        phase = -1 ! release internal memory
        call pardiso (pt_in, maxfct, mnum, mtype, phase, n_in, ddum, idum,    &
        idum, idum, nrhs, iparm, msglvl, ddum, ddum, error)

I did not release memory every step in Newton loop, is this necessary?

0 Kudos
danielsue
Beginner
360 Views

danielsue wrote:

Hi All,

I have implemented Pardiso to solve my problem and it works fine. But the memory cost and efficiency of Pardiso is not outstanding. We have another sparse solver (Solver A, not parallelized) that use level-based preconditioner and CGS-based acceleration.The memory cost for Pardiso is about three to four times of Solver A. For most of my matrix, if I use only one processor for Pardiso, Solver A is much faster, if I use six processors, they are almost the same speed. For some large matrix, Solver A is even faster than Pardiso even if Pardiso uses 6 processors. I compared the runtime of numerical factorization and substitution, and found that solver A is much more efficient. We do not expect Pardiso to beat Solver A if Pardiso runs in serial mode. But we do hope Pardiso to beat Solver A when it runs in parallel mode using 6 processors.

The speedup for Pardiso seems good, I can get a speedup of 4.5 if I use 6 processors.

Here are the parameter I use.      

        maxfct = 1        
        mnum = 1        
        nrhs = 1        
        error = 0 ! initialize error flag        
        msglvl = 0 ! print statistical information        
        mtype = 11 ! real unsymmetric

        iparm= 0
        iparm(1) = 1 ! no solver default
        iparm(2) = 3 ! fill-in reordering from METIS ,0-MIN DEGREE, 2-METIS, 3-OPENMP VERSION
        iparm(3) = 0 ! numbers of processors. Input the next call mkl_set_dynamic(0), mkl_set_num_threads(n);    
        iparm(4) = 61 ! 0-no iterative-direct algorithm; 10*L+K, K=1 CGS, K=2 CGS for symmetric, 1.0E-L: stopping criterion
        iparm(5) = 0 ! no user fill-in reducing permutation
        iparm(6) = 0 ! if == 0, the array of b is replaced with the solution x.
        iparm(7) = 0 ! Output, Number of iterative refinement steps performed
        iparm(8) = 9 ! numbers of iterative refinement steps, must be 0 if a solution is calculated with separate substitutions
        iparm(9) = 0 ! not in use
        iparm(10) = 13 ! Default value 13, perturbe the pivot elements with 1E-13
        iparm(11) = 1 ! use nonsymmetric permutation and scaling MPS
        iparm(12) = 0 ! not in use
        iparm(13) = 1 ! maximum weighted matching algorithm is switched-on (default for non-symmetric)
        iparm(14) = 0 ! Output: number of perturbed pivots
        iparm(15) = 0 ! Output, Peak memory on symbolic factorization.
        iparm(16) = 0 ! Output, Permanent memory on symbolic factorization. This value is only computed in phase 1.
        iparm(17) = 0 ! Output, Size of factors/Peak memory on numerical factorization and solution.
        iparm(18) = 0 ! Input/output. Report the number of non-zero elements in the factors. >= 0 Disable reporting.
        iparm(19) = 0 ! Input/output. Report number of floating point operations to factor matrix A. >= 0 Disable reporting.
        iparm(20) = 0 ! Output: Numbers of CG Iterations. >0 CGS succeeded, reports the number of completed iterations.
        iparm(24) = 1 ! Parallel factorization control, 0: classic algorithm, 1: two-level factorization algorithm, improve scalability on many threads.
        iparm(25) = 0 ! Parallel forward/backward solve control. 0: Use parallel algorithm for the solve step; 1: Use the sequential forward/backward solve.

I tried to change iparm(4) and iparm(10), that does not make much difference. How can I improve the efficiency of Pardiso?

Thanks and best regards,

Daniel

The codes are as follows:

!codes start
        phase = 11 ! only reordering and symbolic factorization  
       
        call pardiso (pt_in, maxfct, mnum, mtype, phase, n_in, a_in, ia_in, ja_in, &
        idum, nrhs, iparm_in, msglvl, ddum, ddum, error)
        
!Newton loop

        phase = 22 ! only factorization
        call pardiso (pt_in, maxfct, mnum, mtype, phase, n_in, a_in, ia_in, ja_in, &
        idum, nrhs, iparm_in, msglvl, ddum, ddum, error)

        phase = 33 ! only substitution
            
        call pardiso (pt_in, maxfct, mnum, mtype, phase, n_in, a_in, ia_in, ja_in, &
            idum, nrhs, iparm_in, msglvl, b_in, x_out, error)

        IF THE RESULT IS NOT GOOD, CALL SYBMOLIC FACTORIZATION AGAIN.

!Newton loop

        phase = -1 ! release internal memory
        call pardiso (pt_in, maxfct, mnum, mtype, phase, n_in, ddum, idum,    &
        idum, idum, nrhs, iparm, msglvl, ddum, ddum, error)

!codes end

I didn't release memory every newton loop, is this necesary?

0 Kudos
Alexander_K_Intel2
360 Views

Hi Daniel,

Nice to hear you again. :) First of all I will try to switch on parallel solving. Also I see that you turn on iterative algorithm instead of factorization step, did you try to switch it off (iparm(4) = 0)? Also it make sense to obtain solution on each nonlinear iteration by internal iterative routine, for example you can make several iteration of fgmres method using current matrix as stiffness matrix and solving step of pardiso with LU decomposition of previous matrix as preconditioner. Solution from previous non-linear iterative step could be initial solution for internal iterative algorithm. But, as i wrote before, in such case we modified algorithm on nonlinear solver and nobody can guarantee that solutions will be the same :) Does this make sense?

With best regards,

Alexander Kalinkin

P.S. Also could you provide any information about solver A?

0 Kudos
danielsue
Beginner
360 Views

Alexander Kalinkin (Intel) wrote:

Hi Daniel,

Nice to hear you again. :) First of all I will try to switch on parallel solving. Also I see that you turn on iterative algorithm instead of factorization step, did you try to switch it off (iparm(4) = 0)? Also it make sense to obtain solution on each nonlinear iteration by internal iterative routine, for example you can make several iteration of fgmres method using current matrix as stiffness matrix and solving step of pardiso with LU decomposition of previous matrix as preconditioner. Solution from previous non-linear iterative step could be initial solution for internal iterative algorithm. But, as i wrote before, in such case we modified algorithm on nonlinear solver and nobody can guarantee that solutions will be the same :) Does this make sense?

With best regards,

Alexander Kalinkin

P.S. Also could you provide any information about solver A?

Hi Alexander,

Thanks for the quick reply.

I also tried to turn off iterative algorithm (iparm(4) = 0), this does not guarantee higher speed. Sometimes it is faster when iparm(4) = 0. Currently I didn't modify algorithm on nonlinear solver

SolverA is Waterloo Sparese, Iterative Matrix Solver. I attached a user's guide and notes for this solver.

Thanks and regards,

Daniel

0 Kudos
danielsue
Beginner
360 Views

Please find the attached pdf file for the introduction of Solver A.

0 Kudos
Reply