Speed up the matrix inversion computation with PARDISO?

Nan_Deng · ‎03-27-2012

The most time consuming part of my application is toinvert a fully-populated complex matrix. I used the PARDISO solver from MKL, and the process works fine, except when the matrix getting larger.

In a recent runthe matrix to be inverted is of the size 60200 by 60200. The machine running the code is an HP DL980 server with 64 cores at 2.4GHz, 1TB memory.

The solutionprocess appears to be in three stages:first the codeuses only one core for about 90-150 hours (for different frequencies) using about 170GB of memory, then the parallel processing part kicking in, uses up to 32 cores for about 8 - 10 hours with up to 320 GB of memory, finally the run uses one core again for about 10-15 hours, with 120 - 170 GB of memory (All times are calendar time. CPU time is about 400 hours in total).

The most frustrating period during the run is obviously the first stage. 90 - 150 hours for a single processor are about 4 - 6 days while all other processing power of 63 cores are wasted. I wonder if there is any way to speed up the period and utilize the power of other cores? Even just a factor of two (may be factorize odd and even rows concurrently?) would be greatly improve the performance. Is there any pre-processor we can do to the matrix to get it run faster?

I'd appreciate any input and ideas on how to improve the code.

Nan Deng

Chao_Y_Intel · ‎03-27-2012

Hello,

How is the parallel parameter set in the computation? To enable the parallel in teh phase 1, it needs to set the iparm(2) as 3:

iparm:

If iparm(2) = 3, the parallel (OpenMP) version of the nested dissection algorithm is used. It can decrease the time of computations on multi-core computers, especially when PARDISO Phase 1 takes significant time.

The default value of iparm(2) is 2

Thanks,
Chao

Nan_Deng · ‎03-28-2012

Thanks for your suggestion. I tried to useiparm(2)=3 and tried a smaller size model, but there is no improvement for over all solution time. Did I miss anything? Should I try some other parameters?

By the way, is iparm(2)=3 implemented by Intel? I checked the latest PARDISO manual (ver. 4.1.2, updated 2/12/2011) and there is no mention about this option. Is the latest intel version of pardiso compatible with the original author's latest version? Say in the manual 4.1.2, there aredefinitions for iparm(n)n = 34, 51, 52 whichrelated to parallel processing but are not defined in the MKL's help file, are these options still valid under intel's version?

Also, iparm(3) is not defined in intel's help file, can I just leave it0? how can we determine the number ofcores to be used in the process?

Thanks,
Nan Deng

Alexander_K_Intel2 · ‎03-28-2012

Hi,

Do you use Intel MKL PARDISO or PARDISO from Basel? If you use Intel PARDISO could you provide names of MKL library which you use to compile your project? If you link serial version of libraries then only one thread will use for PARDISO solver, if use parallel version and don't set some additional parameters parallel version of PARIDSO with maximum number of threads will be used. That's why I asked about your linkline.

With best regards,

Alexander Kalinkin

Nan_Deng · ‎03-29-2012

I use the latest version of the composer, installed in my Windows 2008 Visual Studio shell from previous installation. And I use PARDISO from MKL.

The following is the content in the pull-down menu "project properties - configuration properties - linker - command line". I hope this is the line you asked.

/OUT:"x64\Release\ANALYS.exe" /INCREMENTAL:NO /NOLOGO /LIBPATH:"c:\program files (x86)\intel\composer XE 2011 SP1\mkl\lib\intel64" /LIBPATH:"c:\program files (x86)\intel\composer XE 2011 SP1\lib\intel64" /MANIFEST /MANIFESTFILE:"D:\SASSI2010\ANALYS\ANALYS\x64\Release\ANALYS.exe.intermediate.manifest" /MANIFESTUAC:"level='asInvoker' uiAccess='false'" /SUBSYSTEM:CONSOLE /IMPLIB:"D:\SASSI2010\ANALYS\ANALYS\x64\Release\ANALYS.lib" mkl_intel_ilp64.lib mkl_intel_thread.lib mkl_core.lib libiomp5md.lib

Please let me know if I answered your question and please advise on where I should go to improve, especiall the performance in phase 1 and 3.

Thanks,
Nan Deng

Alexander_K_Intel2 · ‎03-29-2012

HI,

Thanks for answering, your linkline seems to be good. Could you also send your example with pardiso call to reproduce problem on our side?

With best regards,

Alexander Kalinkin

Nan_Deng · ‎03-29-2012

Hi,

As I just answered your reply on the other post, the smallest example I can send you is of the size 420MB, the matrix inversionfor this one takes about 250 sec on my machine (HP DL380, 24 Core, 192 GB RAM), but it should be enough to observe the single-core operation for the first phase. As for the calling routines for PARDISO, it is inside a large, multiple module program, but I can cut the piece of the routine to you so you can see whether the parameters I use are correct. Please provide an email address so I can usefile transfer service to send the files.

Thanks,
Nan Deng

Victor_K_Intel1 · ‎03-30-2012

Hi Nan Deng,

I just wonder why dealing with dense matrix (if I understood correctly your term 'full populated') you call PARDISO? LAPACK is more suitable for such case.
And the second question: Do you need to solve a system of equations or invert a matrix as the topic intitled?
Thanks
Victor

Nan_Deng · ‎03-30-2012

Hi Victor,

PARDISO was originally implemented in the code as the solver for global equations (something very big and sparse), and this dense matrix is a component of the global matrix which has to be inverted to be assembled into theglobal matrix. Since pardiso is already in, it was a simple matter to write a conversion routine for the index arrays and make another call. I didn't use LAPACK because (1) I was not sure whether LAPACK can handle matrix of this size (see my post #1), and (2) I don't know whether LAPACK routines are fully parallelized. I certainly would be willing to try LAPACK routines if you can confirm that my concerns are not the problem.

For your second question, I think solving a system of equations or inverting a matrix are the same thing for factorization and forward substitution, except the latter need to backsubstitute the unit vector as rhs n times. Do you have a more efficient routine that can cut down the execution time for any of the three phases in matrix inversion?

Thanks,
Nan Deng

styc · ‎03-30-2012

In most cases, when you need to explicitly invert a matrix, you are doing something seriously wrong. Both accuracy-wise and performance-wise. By matrix inversion you boost the condition number of the problem. This unnecessarily invites extra numerical errors. I don't know about your machine, but I have four quad-core AMD Opteron 8350's bought in 2007 running at 2 GHz. LAPACK's DGETRF factorizes a 60200-by-60200 matrix in 25 minutes using those 16 cores.

mecej4 · ‎03-31-2012

> For your second question, I think solving a system of equations or inverting a matrix are the same thing for factorization and forward substitution, except the latter need to backsubstitute the unit vector as rhs n times.

Correct, but note that that n-1 of the n back-substitutions are wasted. Because of the way that the typical L-U factorization is carried out in Lapack, you are also going to do n-1 wasted forward eliminations.

Apart from that, you waste memory to hold the inverse matrix.

Nan_Deng · ‎03-31-2012

Appreciate your timing data. Since my matrix is complex*16, I would assume that I need to call ZGETRF and the total number of Gflops may be a factor of 4. But even that I can get this part down in less than 2 hours, that would be great. I certainly will test this routine out. As for why the inversion of the matrix is necessary, it is determined by the methodology and the physics of theunderlying model. I certainly would welcome all suggestions on how to improve the code. The code deals with problems in earthquake engineering field. If you are interested, let me know and I can provide more technical details.

styc · ‎03-31-2012

Then probably youwant to rework your methodology to eliminate explicit inversion. Unless you absolutely need individual elements of A^-1, in most cases there is no reason to explicitly form it, even with LAPACK (GETRI is way slower than GETRF), especially when you need only products like A^-1x.