- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Dear all,

I use the MKL under WIndows 7 64bit + VS 2012 + MKL 11.0 or 11.2.

I need to solve very large FEM matrix by PARDISO. Currently I set parameters as follows:

iparm(1) = 1 ! no solver default

iparm(2) = 3 ! parallel (OpenMP) version of the nested dissection algorithm

iparm(4) = 0 ! no iterative-direct algorithm

iparm(5) = 0 ! no user fill-in reducing permutation

iparm(6) = 0 ! =0 solution on the first n compoments of x

iparm(8) = 9 ! numbers of iterative refinement steps

iparm(10) = 13 ! perturbe the pivot elements with 1E-13

iparm(11) = 1 ! use nonsymmetric permutation and scaling MPS

iparm(13) = 0 ! maximum weighted matching algorithm is switched-off (default for symmetric). Try iparm(13) = 1 in case of inappropriate accuracy

iparm(18) = -1 ! Output: number of nonzeros in the factor LU

iparm(19) = -1 ! Output: Mflops for LU factorization

iparm(20) = 0 ! Output: Numbers of CG Iterations

iparm(21) = 1 ! Apply 1x1 and 2x2 Bunch and Kaufman pivoting during the factorization process

iparm(24) = 1 ! PARDISO uses new two-level factorization algorithm

The peak memory usage would reach to 20 GB (windows task manager->commit size). Is there any better parameter settings to decrease memory usage but don't increase the solution time significant?

Thanks,

Zhanghong Tang

Link Copied

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Hi Zhanghong,

Can you check the out-of-core solver for the pardiso?

iparm(60): switches between in-core (IC) and out-of-core (OOC) Intel MKL PARDISO. OOC can solve very large problems by holding the matrix factors in files on the disk, which requires a reduced amount of main memory compared to IC.

Thanks,

Chao

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Dear Chao,

Thank you very much for your kindly reply. I have tested the OOC and it is very very slow. The most important is that I have tens of processes call the PARDISO at the same time and the OOC would lead to very large disk usage.

I wish to decrease memory usage of every process and then I can launch more processes in every machine (48 GB memory / machine and I wish to launch 3 processes in every machine).

Thanks,

Zhanghong Tang

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Hi Zhanghong,

May I ask you what iparm(15), iparm(16), iparm(17) and iparm(63) returned by PARDISO after reordering step and what value of MKL_PARDISO_OOC_MAX_CORE_SIZE have you set? I ask because OOC PARDISO algorithm is greedy and use all available memory to improve performance. So, if you want to improvement solution time of OOC PARDISO just increase this parameter. Also, i see that you switch on two-level algorithm of factorization - in current version of OOC PARDISO this mode doesn't support.

Thanks,

Alex

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Dear all,

Thank you very much for your kindly reply. After my several days' checking, I found the problem is from me. The filled matrix is not valid.

Now I have another question: for some large problem the memory is not enough (but I don't want to use the OOC), I can use the CGS by setting `iparm(4)` to non-zero, right? So how to setup this parameter? I have the following code, is it correct?

phase = 11

call pardiso(...)

phase = 22

call pardiso(...)

if(error=-2)then ! not enough memory

iparm(4) = 72

call pardiso(...)

endif

phase = 33

call pardiso(...)

phase = -1

call pardiso(...)

Thanks,

Zhanghong Tang

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Furthermore, can I decide whether to use direct method or not by iparm(15), just like this?

phase = 11

call pardiso(...)

phase = 22

if(iparm(15)>16777216)then ! 16 GB: 16 kB * 1024 * 1024

iparm(4)=72

endif

call pardiso(...)

if(error=-2)then ! not enough memory

iparm(4) = 72

call pardiso(...)

endif

phase = 33

call pardiso(...)

phase = -1

call pardiso(...)

However, I also noticed iparm(20), when the cgs_error is 5 it means factorization is too fast for this matrix. It is better to use the factorization method with `iparm(4)` = 0

How to process this problem?

Thanks,

Zhanghong Tang

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Dear all,

Is there any suggestion? Currently I did as what I said in my previous post except the following line changed:

if(max(iparm(15),iparm(16)+iparm(17))>16777216)then ! 16 GB: 16 kB * 1024 * 1024

However, the solution is still done by direct method but the memory usage is larger than 16 GB, usually the peak memory usage reaches to 21 GB.

Could anyone help me to take a look at it?

Thanks,

Zhanghong Tang

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Not many of the non-Intel readers here have multiple machines with 48 GB of RAM. If you could scale down the problem to, say, 4 GB, while still seeing the issues that you have mentioned, you would see more answers.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Dear mecej4,

Thank you very much for your kindly reply.

I think it is simple to let it switch to iterative solver by 4 GB limit:

if(max(iparm(15),iparm(16)+iparm(17))>4194304)then ! 4 GB: 4 kB * 1024 * 1024

iparm(4)=72

endif

But now the problem is that the solver doesn't work as I set and I don't know what I have missed in parameter setting. Could you please help me to take a look at it?

Thanks,

Zhanghong Tang

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

If you provide complete source for an example, I should be happy to look at it. It is, however, not reasonable to expect us to reconstruct an example ourselves, using as basis a piecemeal description of the problem.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Dear mecej4,

Thank you very much for your kindly reply. The following are the complete source code I used to solve my problem. Could you please help me to take a look at it?

subroutine solverall_pardiso(nz,irow,jcol,s,n,nrhs,nzr,irz,jrz,rz,x) ! sparse right hand side ! CSR format input for PARDISO USE mkl_pardiso implicit none TYPE(MKL_PARDISO_HANDLE), ALLOCATABLE :: pt(:) INTEGER, ALLOCATABLE :: iparm( : ) integer::maxfct, mnum, mtype, phase, n,nz,nnz,nzr,nrhs, error, msglvl integer,pointer::irow(:),jcol(:),irz(:),jrz(:) real*8,pointer::s(:), rz(:) real*8,pointer::x(:,:),xi(:),bi(:) real*8,allocatable::b(:) integer::i,j,ii,ir,jc,idum(1) real*8::ddum(1) call mkl_set_dynamic(0) ! prevent Intel MKL from dynamically reducing the number of threads in nested parallel regions. maxfct=1 mnum=1 idum=0 ddum=0.0d0 ALLOCATE( iparm ( 64 ) ) do i = 1, 64 iparm(i) = 0 end do iparm(1) = 1 ! no solver default iparm(2) = 3 ! parallel (OpenMP) version of the nested dissection algorithm iparm(4) = 0 ! no iterative-direct algorithm iparm(5) = 0 ! no user fill-in reducing permutation iparm(6) = 0 ! =0 solution on the first n compoments of x iparm(8) = 9 ! numbers of iterative refinement steps iparm(10) = 13 ! perturbe the pivot elements with 1E-13 iparm(11) = 1 ! use nonsymmetric permutation and scaling MPS iparm(13) = 0 ! maximum weighted matching algorithm is switched-off (default for symmetric). Try iparm(13) = 1 in case of inappropriate accuracy iparm(18) = -1 ! Output: number of nonzeros in the factor LU iparm(19) = -1 ! Output: Mflops for LU factorization iparm(20) = 0 ! Output: Numbers of CG Iterations iparm(21) = 1 ! Apply 1x1 and 2x2 Bunch and Kaufman pivoting during the factorization process iparm(24) = 1 ! PARDISO uses new two-level factorization algorithm error = 0 msglvl = 0 mtype = -2 !.. Initiliaze the internal solver memory pointer. This is only ! necessary for the FIRST call of the PARDISO solver. ALLOCATE ( pt ( 64 ) ) do i = 1, 64 pt( i )%DUMMY = 0 end do !.. Reordering and Symbolic Factorization, This step also allocates ! all memory that is necessary for the factorization phase = 11 ! only reordering and symbolic factorization call pardiso (pt, maxfct, mnum, mtype, phase, n, s, irow, jcol, idum, nrhs, iparm, msglvl, ddum, ddum, error) if(error/=0)write(*,*) 'error in solver at step 1, error code = ',error if(error/=0)stop !.. Factorization. phase = 22 ! only factorization ! add new feature: check the peak memory usage by pardiso, if larger than available, use iterative instead of ! direct method ! Zhanghong Tang @ 09/25/2014 if(max(iparm(15),iparm(16)+iparm(17))>16777216)then ! 16 GB: 16 kB * 1024 * 1024 iparm(4)=72 endif call pardiso (pt, maxfct, mnum, mtype, phase, n, s, irow, jcol, idum, nrhs, iparm, msglvl, ddum, ddum, error) if(error/=0)write(*,*) 'error in solver at step 2, error code = ',error if(error/=0)stop !.. Back substitution and iterative refinement iparm(8) = 2 ! max numbers of iterative refinement steps phase = 33 ! only factorization if(allocated(b))deallocate(b) allocate(b(n)) b=0.0d0 do i=1,nrhs xi=>x(:,i) do j=1,nzr if(jrz(j)==i)b(irz(j))=rz(j) enddo call pardiso (pt, maxfct, mnum, mtype, phase, n, s, irow, jcol, idum, 1, iparm, msglvl, b, xi, error) do j=1,nzr if(jrz(j)==i)b(irz(j))=0.0d0 enddo if(error/=0)write(*,*) 'error in solver at step 3, error code = ',error if(error/=0)stop enddo !.. Termination and release of memory phase = -1 ! release internal memory call pardiso (pt, maxfct, mnum, mtype, phase, n, ddum, idum, idum, idum, nrhs, iparm, msglvl, ddum, ddum, error) IF ( ALLOCATED( iparm ) ) DEALLOCATE( iparm ) IF ( ALLOCATED( pt ) ) DEALLOCATE( pt ) if(allocated(b))deallocate(b) end subroutine

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

You provided only the source for the subroutine. Please provide the source code for a main program that calls this subroutine with the input arguments properly filled in. It takes less effort to let the computer try to run the program and produce some results before one looks at the source code line by line. Secondly, whether a problem occurs depends often on the data on which the program is run.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Dear mecej4,

Thank you very much for your kindly reply. The main program only contains several lines-read sparse matrix and the call the subroutine. However, the sparse matrix is too large (more than 1 GB size) to be uploaded.

The code I pasted can work perfectly if the memory is enough. Now the problem is that it can't switch to iterative solver when estimated memory usage exceed the set value. I don't know where I have missed.

In addition, I don't think it requires my complete project (include the complete sparse matrix data) to reproduce this problem: it is a general issue and the problem will occur when the size of sparse matrix reaches to some value, for example, 1000000.

However, if you will still require the matrix, could you please suggest a space and I can upload the sparse matrix to it?

Thanks,

Zhanghong Tang

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

A short program that generates the matrix elements (by calculation) would work nicely. After all, nobody typed in 1 GB worth of data into a file, did they?

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Dear mecej4,

Thank you very much for your kindly reply.

But I am sorry that the program that generates the matrix elements is not short or simple. It is from electromagnetic FEM problem.

On the other hand, as I said before, that's a general problem and it is not related to my special sparse matrix. That is to say, I think the expert will not need the data to debug the subroutine to point out my problem--the problem that set parameters to let the PARDISO choose work by direct method or iterative method.

Thanks,

Zhanghong Tang

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Dear Intel administrator,

Is there any document/instruction of PARDISO to describe how to use/launch iterative solver when evaluated memory usage larger than the limit set before? For example, how to launch iterative solver (set iparm(4) to non-zero) when `max`

`(iparm(15),iparm(16)+iparm(17)) is larger than 16777216 (16 GB)?`

`Thanks,`

`Zhanghong Tang`

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Hi,

The main idea of CGS is to use iterative solver for case when we need to solve several systems with precondition as factorized matrix from one system. So it can be used in case of solve several systems only.

I really recommend to set iparm(60) to 1 and MKL_PARDISO_OOC_MAX_CORE_SIZE to memory size that you ready to use for pardiso. In such case PARDISO will choose internally algorithm to use only memory size that set by MKL_PARDISO_OOC_MAX_CORE_SIZE

Thanks,

Alex

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Dear Alex,

Thank you very much for your kindly reply. Now the problem is that I will launch 3 or more processes on every machine to solve different systems. The total memory is 48 GB on every machine and I need to solve hundreds of this kind of systems. I am afraid that the write/read large data to/from disk will decrease the speed signification. On the other hand, I am afraid that the disk space is not enough (suppose 40 GB per matrix, the required disk space is at least 40*3 - 48 = 72 GB) since the machines are also used by other people.

I don't know whether the PARDISO do an imcomplete LU factorization according the limit set before and then use CGS to solve the problem.

Thanks,

Zhanghong Tang

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Dear Alex,

I tried to use the OOC of PARDISO, but the PARDISO returns error code -10. Could you please help me to take a look at it?

Thanks,

Zhanghong Tang

PS: part of the code:

...

iparm(60)=2

i = MAKEDIRQQ('c:\temp\')

if(myid<10)then

write(ss2 , '(i1)') myid

elseif(myid<100)then

write(ss2 , '(i2)') myid

elseif(myid<1000)then

write(ss2 , '(i3)') myid

elseif(myid<10000)then

write(ss2 , '(i4)') myid

elseif(myid<100000)then

write(ss2 , '(i5)') myid

endif

file_name = 'c:\temp\'//trim(ss2)

bufLen = len_trim(file_name)

call mkl_cvt_to_null_terminated_str(buff, bufLen, trim(file_name))

error = pardiso_setenv(pt, PARDISO_OOC_FILE_NAME, buff)

open(1,file='pardiso_ooc.cfg',status='new',err=1)

write(1,*)'MKL_PARDISO_OOC_MAX_CORE_SIZE = ',16384

write(1,*)'MKL_PARDISO_OOC_KEEP_FILE = 0'

close(1)

call pardiso(...)

The program is launched as following command:

mpiexec -wdir "Z:\test" -mapall -hosts 10 n01 2 n02 2 n03 4 n04 4 n05 4 n06 4 n07 4 n08 4 n09 4 n10 4 Z:\fem

where Z: is a mapped driver shared to all hosts.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Hi,

Look's like PARDISO got problem during reading/writing OOC file. Could you set msglvl to 1 and attach output that pardiso return to the screen?

Thanks,

Alex

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Dear Alex,

Thank you very much for your kindly reply. The screen output is as follows:

=== PARDISO: solving a symmetric indefinite system ===

The local (internal) PARDISO version is : 103911000

1-based array indexing is turned ON

PARDISO double precision computation is turned ON

Parallel METIS algorithm at reorder step is turned ON

Scaling is turned ON

Summary: ( reordering phase )

================

Times:

======

Time spent in calculations of symmetric matrix portrait (fulladj): 1.468480 s

Time spent in reordering of the initial matrix (reorder) : 20.758079 s

Time spent in symbolic factorization (symbfct) : 4.008938 s

Time spent in data preparations for factorization (parlist) : 0.188263 s

Time spent in allocation of internal data structures (malloc) : 0.310438 s

Time spent in additional calculations : 4.130998 s

Total time spent : 30.865196 s

Statistics:

===========

< Parallel Direct Factorization with number of processors: > 24

< Numerical Factorization with BLAS3 and O(n) synchronization >

< Linear system Ax = b >

number of equations: 2547445

number of non-zeros in A: 36680014

number of non-zeros in A (): 0.000565

number of right-hand sides: 7

< Factors L and U >

number of columns for each panel: 192

number of independent subgraphs: 0

number of supernodes: 955120

size of largest supernode: 4429

number of non-zeros in L: 750724073

number of non-zeros in U: 1

number of non-zeros in L+U: 750724074

=== PARDISO is running in Out-Of-Core mode, because iparam(60)=2 ===

Percentage of computed non-zeros for LL^T factorization

*** Error in PARDISO (read/write OOC data file) error_num= 0

=== PARDISO: solving a symmetric indefinite system ===

Summary: ( factorization phase )

================

Times:

======

Time spent in allocation of internal data structures (malloc) : 30.196333 s

Time spent in additional calculations : -14.900597 s

Total time spent : 15.295736 s

Statistics:

===========

< Parallel Direct Factorization with number of processors: > 24

< Numerical Factorization with BLAS3 and O(n) synchronization >

< Linear system Ax = b >

number of equations: 2547445

number of non-zeros in A: 36680014

number of non-zeros in A (): 0.000565

number of right-hand sides: 7

< Factors L and U >

number of columns for each panel: 192

number of independent subgraphs: 0

number of supernodes: 955120

size of largest supernode: 4429

number of non-zeros in L: 750724073

number of non-zeros in U: 1

number of non-zeros in L+U: 750724074

gflop for the numerical factorization: 639.881780

error in solver at step 2, error code = -11

Thanks,

Zhanghong Tang

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page