Question about parameter setting of PARDISO

Zhanghong_T_ · ‎09-01-2014

Dear all,

I use the MKL under WIndows 7 64bit + VS 2012 + MKL 11.0 or 11.2.

I need to solve very large FEM matrix by PARDISO. Currently I set parameters as follows:

iparm(1) = 1 ! no solver default
iparm(2) = 3 ! parallel (OpenMP) version of the nested dissection algorithm
iparm(4) = 0 ! no iterative-direct algorithm
iparm(5) = 0 ! no user fill-in reducing permutation
iparm(6) = 0 ! =0 solution on the first n compoments of x
iparm(8) = 9 ! numbers of iterative refinement steps
iparm(10) = 13 ! perturbe the pivot elements with 1E-13
iparm(11) = 1 ! use nonsymmetric permutation and scaling MPS
iparm(13) = 0 ! maximum weighted matching algorithm is switched-off (default for symmetric). Try iparm(13) = 1 in case of inappropriate accuracy
iparm(18) = -1 ! Output: number of nonzeros in the factor LU
iparm(19) = -1 ! Output: Mflops for LU factorization
iparm(20) = 0 ! Output: Numbers of CG Iterations
iparm(21) = 1 ! Apply 1x1 and 2x2 Bunch and Kaufman pivoting during the factorization process
iparm(24) = 1 ! PARDISO uses new two-level factorization algorithm

The peak memory usage would reach to 20 GB (windows task manager->commit size). Is there any better parameter settings to decrease memory usage but don't increase the solution time significant?

Thanks,

Zhanghong Tang

Chao_Y_Intel · ‎09-01-2014

Hi Zhanghong,

Can you check the out-of-core solver for the pardiso?
iparm(60): switches between in-core (IC) and out-of-core (OOC) Intel MKL PARDISO. OOC can solve very large problems by holding the matrix factors in files on the disk, which requires a reduced amount of main memory compared to IC.

Thanks,
Chao

Zhanghong_T_ · ‎09-01-2014

Dear Chao,

Thank you very much for your kindly reply. I have tested the OOC and it is very very slow. The most important is that I have tens of processes call the PARDISO at the same time and the OOC would lead to very large disk usage.

I wish to decrease memory usage of every process and then I can launch more processes in every machine (48 GB memory / machine and I wish to launch 3 processes in every machine).

Thanks,

Zhanghong Tang

Alexander_K_Intel2 · ‎09-02-2014

Hi Zhanghong,

May I ask you what iparm(15), iparm(16), iparm(17) and iparm(63) returned by PARDISO after reordering step and what value of MKL_PARDISO_OOC_MAX_CORE_SIZE have you set? I ask because OOC PARDISO algorithm is greedy and use all available memory to improve performance. So, if you want to improvement solution time of OOC PARDISO just increase this parameter. Also, i see that you switch on two-level algorithm of factorization - in current version of OOC PARDISO this mode doesn't support.

Thanks,

Alex

Zhanghong_T_ · ‎09-24-2014

Dear all,

Thank you very much for your kindly reply. After my several days' checking, I found the problem is from me. The filled matrix is not valid.

Now I have another question: for some large problem the memory is not enough (but I don't want to use the OOC), I can use the CGS by setting iparm(4) to non-zero, right? So how to setup this parameter? I have the following code, is it correct?

phase = 11

call pardiso(...)

phase = 22

call pardiso(...)

if(error=-2)then ! not enough memory

iparm(4) = 72

call pardiso(...)

endif

phase = 33

call pardiso(...)

phase = -1

call pardiso(...)

Thanks,

Zhanghong Tang

Zhanghong_T_ · ‎09-24-2014

Furthermore, can I decide whether to use direct method or not by iparm(15), just like this?

phase = 11

call pardiso(...)

phase = 22

if(iparm(15)>16777216)then ! 16 GB: 16 kB * 1024 * 1024
iparm(4)=72
endif

call pardiso(...)

if(error=-2)then ! not enough memory

iparm(4) = 72

call pardiso(...)

endif

phase = 33

call pardiso(...)

phase = -1

call pardiso(...)

However, I also noticed iparm(20), when the cgs_error is 5 it means factorization is too fast for this matrix. It is better to use the factorization method with iparm(4) = 0

How to process this problem?

Thanks,

Zhanghong Tang

Zhanghong_T_ · ‎09-26-2014

Dear all,

Is there any suggestion? Currently I did as what I said in my previous post except the following line changed:

if(max(iparm(15),iparm(16)+iparm(17))>16777216)then ! 16 GB: 16 kB * 1024 * 1024

However, the solution is still done by direct method but the memory usage is larger than 16 GB, usually the peak memory usage reaches to 21 GB.

Could anyone help me to take a look at it?

Thanks,

Zhanghong Tang

mecej4 · ‎09-27-2014

Not many of the non-Intel readers here have multiple machines with 48 GB of RAM. If you could scale down the problem to, say, 4 GB, while still seeing the issues that you have mentioned, you would see more answers.

Zhanghong_T_ · ‎09-27-2014

Dear mecej4,

Thank you very much for your kindly reply.

I think it is simple to let it switch to iterative solver by 4 GB limit:

if(max(iparm(15),iparm(16)+iparm(17))>4194304)then ! 4 GB: 4 kB * 1024 * 1024

iparm(4)=72

endif

But now the problem is that the solver doesn't work as I set and I don't know what I have missed in parameter setting. Could you please help me to take a look at it?

Thanks,

Zhanghong Tang

mecej4 · ‎09-27-2014

If you provide complete source for an example, I should be happy to look at it. It is, however, not reasonable to expect us to reconstruct an example ourselves, using as basis a piecemeal description of the problem.

Zhanghong_T_ · ‎09-27-2014

Dear mecej4,

Thank you very much for your kindly reply. The following are the complete source code I used to solve my problem. Could you please help me to take a look at it?

  subroutine solverall_pardiso(nz,irow,jcol,s,n,nrhs,nzr,irz,jrz,rz,x)
  ! sparse right hand side
  ! CSR format input for PARDISO
  USE mkl_pardiso
  implicit none
  TYPE(MKL_PARDISO_HANDLE), ALLOCATABLE  :: pt(:)
  INTEGER, ALLOCATABLE :: iparm( : )
  integer::maxfct, mnum, mtype, phase, n,nz,nnz,nzr,nrhs, error, msglvl 
  integer,pointer::irow(:),jcol(:),irz(:),jrz(:)
  real*8,pointer::s(:), rz(:)
  real*8,pointer::x(:,:),xi(:),bi(:)
  real*8,allocatable::b(:)
  integer::i,j,ii,ir,jc,idum(1)
  real*8::ddum(1)

  call mkl_set_dynamic(0)          !  prevent Intel MKL from dynamically reducing the number of threads in nested parallel regions. 
  
  maxfct=1
  mnum=1
  idum=0  
  ddum=0.0d0
  ALLOCATE( iparm ( 64 ) )
  do i = 1, 64
     iparm(i) = 0
  end do 
  iparm(1) = 1 ! no solver default
  iparm(2) = 3 ! parallel (OpenMP) version of the nested dissection algorithm
  iparm(4) = 0 ! no iterative-direct algorithm
  iparm(5) = 0 ! no user fill-in reducing permutation
  iparm(6) = 0 ! =0 solution on the first n compoments of x
  iparm(8) = 9 ! numbers of iterative refinement steps
  iparm(10) = 13 ! perturbe the pivot elements with 1E-13
  iparm(11) = 1 ! use nonsymmetric permutation and scaling MPS
  iparm(13) = 0 ! maximum weighted matching algorithm is switched-off (default for symmetric). Try iparm(13) = 1 in case of inappropriate accuracy
  iparm(18) = -1 ! Output: number of nonzeros in the factor LU
  iparm(19) = -1 ! Output: Mflops for LU factorization
  iparm(20) = 0 ! Output: Numbers of CG Iterations
  iparm(21) = 1 ! Apply 1x1 and 2x2 Bunch and Kaufman pivoting during the factorization process
  iparm(24) = 1 ! PARDISO uses new two-level factorization algorithm
 
  error = 0  
  msglvl = 0
  mtype = -2  
  !.. Initiliaze the internal solver memory pointer. This is only
  ! necessary for the FIRST call of the PARDISO solver.

  ALLOCATE ( pt ( 64 ) )
  do i = 1, 64
     pt( i )%DUMMY =  0 
  end do
  !.. Reordering and Symbolic Factorization, This step also allocates
  ! all memory that is necessary for the factorization
  phase = 11 ! only reordering and symbolic factorization
  call pardiso (pt, maxfct, mnum, mtype, phase, n, s, irow, jcol, idum, nrhs, iparm, msglvl, ddum, ddum, error)
  if(error/=0)write(*,*) 'error in solver at step 1, error code = ',error
  if(error/=0)stop 
  
  !.. Factorization.
  phase = 22 ! only factorization
  ! add new feature: check the peak memory usage by pardiso, if larger than available, use iterative instead of 
  ! direct method
  ! Zhanghong Tang @ 09/25/2014
  if(max(iparm(15),iparm(16)+iparm(17))>16777216)then   ! 16 GB: 16 kB * 1024 * 1024
    iparm(4)=72
  endif
  call pardiso (pt, maxfct, mnum, mtype, phase, n, s, irow, jcol, idum, nrhs, iparm, msglvl, ddum, ddum, error)
  if(error/=0)write(*,*) 'error in solver at step 2, error code = ',error
  if(error/=0)stop 
  
  !.. Back substitution and iterative refinement
  iparm(8) = 2 ! max numbers of iterative refinement steps
  phase = 33 ! only factorization
  if(allocated(b))deallocate(b)
  allocate(b(n))
  b=0.0d0
  do i=1,nrhs
    xi=>x(:,i)
    do j=1,nzr
      if(jrz(j)==i)b(irz(j))=rz(j)
    enddo
    call pardiso (pt, maxfct, mnum, mtype, phase, n, s, irow, jcol, idum, 1, iparm, msglvl, b, xi, error)
    do j=1,nzr
      if(jrz(j)==i)b(irz(j))=0.0d0
    enddo
    if(error/=0)write(*,*) 'error in solver at step 3, error code = ',error
    if(error/=0)stop 
  enddo 
  
  !.. Termination and release of memory
  phase = -1 ! release internal memory
  call pardiso (pt, maxfct, mnum, mtype, phase, n, ddum, idum, idum, idum, nrhs, iparm, msglvl, ddum, ddum, error)
  IF ( ALLOCATED( iparm ) )   DEALLOCATE( iparm )
  IF ( ALLOCATED( pt ) )   DEALLOCATE( pt )
  if(allocated(b))deallocate(b)
  end subroutine

mecej4 · ‎09-27-2014

You provided only the source for the subroutine. Please provide the source code for a main program that calls this subroutine with the input arguments properly filled in. It takes less effort to let the computer try to run the program and produce some results before one looks at the source code line by line. Secondly, whether a problem occurs depends often on the data on which the program is run.

Zhanghong_T_ · ‎09-27-2014

Dear mecej4,

Thank you very much for your kindly reply. The main program only contains several lines-read sparse matrix and the call the subroutine. However, the sparse matrix is too large (more than 1 GB size) to be uploaded.

The code I pasted can work perfectly if the memory is enough. Now the problem is that it can't switch to iterative solver when estimated memory usage exceed the set value. I don't know where I have missed.

In addition, I don't think it requires my complete project (include the complete sparse matrix data) to reproduce this problem: it is a general issue and the problem will occur when the size of sparse matrix reaches to some value, for example, 1000000.

However, if you will still require the matrix, could you please suggest a space and I can upload the sparse matrix to it?

Thanks,

Zhanghong Tang

mecej4 · ‎09-27-2014

A short program that generates the matrix elements (by calculation) would work nicely. After all, nobody typed in 1 GB worth of data into a file, did they?

Zhanghong_T_ · ‎09-27-2014

Dear mecej4,

Thank you very much for your kindly reply.

But I am sorry that the program that generates the matrix elements is not short or simple. It is from electromagnetic FEM problem.

On the other hand, as I said before, that's a general problem and it is not related to my special sparse matrix. That is to say, I think the expert will not need the data to debug the subroutine to point out my problem--the problem that set parameters to let the PARDISO choose work by direct method or iterative method.

Thanks,

Zhanghong Tang

Zhanghong_T_ · ‎09-30-2014

Dear Intel administrator,

Is there any document/instruction of PARDISO to describe how to use/launch iterative solver when evaluated memory usage larger than the limit set before? For example, how to launch iterative solver (set iparm(4) to non-zero) when max(iparm(15),iparm(16)+iparm(17)) is larger than 16777216 (16 GB)?

Thanks,

Zhanghong Tang

Alexander_K_Intel2 · ‎09-30-2014

Hi,

The main idea of CGS is to use iterative solver for case when we need to solve several systems with precondition as factorized matrix from one system. So it can be used in case of solve several systems only.

I really recommend to set iparm(60) to 1 and MKL_PARDISO_OOC_MAX_CORE_SIZE to memory size that you ready to use for pardiso. In such case PARDISO will choose internally algorithm to use only memory size that set by MKL_PARDISO_OOC_MAX_CORE_SIZE

Thanks,

Alex

Zhanghong_T_ · ‎09-30-2014

Dear Alex,

Thank you very much for your kindly reply. Now the problem is that I will launch 3 or more processes on every machine to solve different systems. The total memory is 48 GB on every machine and I need to solve hundreds of this kind of systems. I am afraid that the write/read large data to/from disk will decrease the speed signification. On the other hand, I am afraid that the disk space is not enough (suppose 40 GB per matrix, the required disk space is at least 40*3 - 48 = 72 GB) since the machines are also used by other people.

I don't know whether the PARDISO do an imcomplete LU factorization according the limit set before and then use CGS to solve the problem.

Thanks,

Zhanghong Tang

Zhanghong_T_ · ‎10-02-2014

Dear Alex,

I tried to use the OOC of PARDISO, but the PARDISO returns error code -10. Could you please help me to take a look at it?

Thanks,

Zhanghong Tang

PS: part of the code:

...
iparm(60)=2
   i = MAKEDIRQQ('c:\temp\')
if(myid<10)then
    write(ss2 , '(i1)') myid
elseif(myid<100)then
    write(ss2 , '(i2)') myid
elseif(myid<1000)then
    write(ss2 , '(i3)') myid
elseif(myid<10000)then
    write(ss2 , '(i4)') myid
elseif(myid<100000)then
    write(ss2 , '(i5)') myid
endif
file_name = 'c:\temp\'//trim(ss2)
bufLen = len_trim(file_name)
call mkl_cvt_to_null_terminated_str(buff, bufLen, trim(file_name))
error = pardiso_setenv(pt, PARDISO_OOC_FILE_NAME, buff)
open(1,file='pardiso_ooc.cfg',status='new',err=1)
write(1,*)'MKL_PARDISO_OOC_MAX_CORE_SIZE = ',16384
write(1,*)'MKL_PARDISO_OOC_KEEP_FILE = 0'
close(1)
call pardiso(...)

The program is launched as following command:

mpiexec -wdir "Z:\test" -mapall -hosts 10 n01 2 n02 2 n03 4 n04 4 n05 4 n06 4 n07 4 n08 4 n09 4 n10 4 Z:\fem

where Z: is a mapped driver shared to all hosts.

Alexander_K_Intel2 · ‎10-02-2014

Hi,

Look's like PARDISO got problem during reading/writing OOC file. Could you set msglvl to 1 and attach output that pardiso return to the screen?

Thanks,

Alex

Zhanghong_T_ · ‎10-02-2014

Dear Alex,

Thank you very much for your kindly reply. The screen output is as follows:

=== PARDISO: solving a symmetric indefinite system ===
The local (internal) PARDISO version is : 103911000
1-based array indexing is turned ON
PARDISO double precision computation is turned ON
Parallel METIS algorithm at reorder step is turned ON
Scaling is turned ON

Summary: ( reordering phase )
================

Times:
======
Time spent in calculations of symmetric matrix portrait (fulladj): 1.468480 s
Time spent in reordering of the initial matrix (reorder)         : 20.758079 s
Time spent in symbolic factorization (symbfct)                   : 4.008938 s
Time spent in data preparations for factorization (parlist)      : 0.188263 s
Time spent in allocation of internal data structures (malloc)    : 0.310438 s
Time spent in additional calculations                            : 4.130998 s
Total time spent                                                 : 30.865196 s

Statistics:
===========
< Parallel Direct Factorization with number of processors: > 24
< Numerical Factorization with BLAS3 and O(n) synchronization >

< Linear system Ax = b >
             number of equations:           2547445
             number of non-zeros in A:      36680014
             number of non-zeros in A (): 0.000565

number of right-hand sides: 7

< Factors L and U >
             number of columns for each panel: 192
             number of independent subgraphs: 0
             number of supernodes:                    955120
             size of largest supernode:               4429
             number of non-zeros in L:                750724073
             number of non-zeros in U:                1
             number of non-zeros in L+U:              750724074
=== PARDISO is running in Out-Of-Core mode, because iparam(60)=2 ===
Percentage of computed non-zeros for LL^T factorization
*** Error in PARDISO (read/write OOC data file) error_num= 0

=== PARDISO: solving a symmetric indefinite system ===

Summary: ( factorization phase )
================

Times:
======
Time spent in allocation of internal data structures (malloc)    : 30.196333 s
Time spent in additional calculations                            : -14.900597 s
Total time spent                                                 : 15.295736 s

Statistics:
===========
< Parallel Direct Factorization with number of processors: > 24
< Numerical Factorization with BLAS3 and O(n) synchronization >

< Linear system Ax = b >
             number of equations:           2547445
             number of non-zeros in A:      36680014
             number of non-zeros in A (): 0.000565

number of right-hand sides: 7

< Factors L and U >
             number of columns for each panel: 192
             number of independent subgraphs: 0
             number of supernodes:                    955120
             size of largest supernode:               4429
             number of non-zeros in L:                750724073
             number of non-zeros in U:                1
             number of non-zeros in L+U:              750724074
             gflop   for the numerical factorization: 639.881780

error in solver at step 2, error code = -11

Thanks,

Zhanghong Tang