In a recent runthe matrix to be inverted is of the size 60200 by 60200. The machine running the code is an HP DL980 server with 64 cores at 2.4GHz, 1TB memory.
The solutionprocess appears to be in three stages:first the codeuses only one core for about 90-150 hours (for different frequencies) using about 170GB of memory, then the parallel processing part kicking in, uses up to 32 cores for about 8 - 10 hours with up to 320 GB of memory, finally the run uses one core again for about 10-15 hours, with 120 - 170 GB of memory (All times are calendar time. CPU time is about 400 hours in total).
The most frustrating period during the run is obviously the first stage. 90 - 150 hours for a single processor are about 4 - 6 days while all other processing power of 63 cores are wasted. I wonder if there is any way to speed up the period and utilize the power of other cores? Even just a factor of two (may be factorize odd and even rows concurrently?) would be greatly improve the performance. Is there any pre-processor we can do to the matrix to get it run faster?
I'd appreciate any input and ideas on how to improve the code.
How is the parallel parameter set in the computation? To enable the parallel in teh phase 1, it needs to set the iparm(2) as 3:
If iparm(2) = 3, the parallel (OpenMP) version of the nested dissection algorithm is used. It can decrease the time of computations on multi-core computers, especially when PARDISO Phase 1 takes significant time.
The default value of iparm(2) is 2
By the way, is iparm(2)=3 implemented by Intel? I checked the latest PARDISO manual (ver. 4.1.2, updated 2/12/2011) and there is no mention about this option. Is the latest intel version of pardiso compatible with the original author's latest version? Say in the manual 4.1.2, there aredefinitions for iparm(n)n = 34, 51, 52 whichrelated to parallel processing but are not defined in the MKL's help file, are these options still valid under intel's version?
Also, iparm(3) is not defined in intel's help file, can I just leave it0? how can we determine the number ofcores to be used in the process?
The following is the content in the pull-down menu "project properties - configuration properties - linker - command line". I hope this is the line you asked.
/OUT:"x64\Release\ANALYS.exe" /INCREMENTAL:NO /NOLOGO /LIBPATH:"c:\program files (x86)\intel\composer XE 2011 SP1\mkl\lib\intel64" /LIBPATH:"c:\program files (x86)\intel\composer XE 2011 SP1\lib\intel64" /MANIFEST /MANIFESTFILE:"D:\SASSI2010\ANALYS\ANALYS\x64\Release\ANALYS.exe.intermediate.manifest" /MANIFESTUAC:"level='asInvoker' uiAccess='false'" /SUBSYSTEM:CONSOLE /IMPLIB:"D:\SASSI2010\ANALYS\ANALYS\x64\Release\ANALYS.lib" mkl_intel_ilp64.lib mkl_intel_thread.lib mkl_core.lib libiomp5md.lib
Please let me know if I answered your question and please advise on where I should go to improve, especiall the performance in phase 1 and 3.
As I just answered your reply on the other post, the smallest example I can send you is of the size 420MB, the matrix inversionfor this one takes about 250 sec on my machine (HP DL380, 24 Core, 192 GB RAM), but it should be enough to observe the single-core operation for the first phase. As for the calling routines for PARDISO, it is inside a large, multiple module program, but I can cut the piece of the routine to you so you can see whether the parameters I use are correct. Please provide an email address so I can usefile transfer service to send the files.
I just wonder why dealing with dense matrix (if I understood correctly your term 'full populated') you call PARDISO? LAPACK is more suitable for such case.
And the second question: Do you need to solve a system of equations or invert a matrix as the topic intitled?
PARDISO was originally implemented in the code as the solver for global equations (something very big and sparse), and this dense matrix is a component of the global matrix which has to be inverted to be assembled into theglobal matrix. Since pardiso is already in, it was a simple matter to write a conversion routine for the index arrays and make another call. I didn't use LAPACK because (1) I was not sure whether LAPACK can handle matrix of this size (see my post #1), and (2) I don't know whether LAPACK routines are fully parallelized. I certainly would be willing to try LAPACK routines if you can confirm that my concerns are not the problem.
For your second question, I think solving a system of equations or inverting a matrix are the same thing for factorization and forward substitution, except the latter need to backsubstitute the unit vector as rhs n times. Do you have a more efficient routine that can cut down the execution time for any of the three phases in matrix inversion?
Correct, but note that that n-1 of the n back-substitutions are wasted. Because of the way that the typical L-U factorization is carried out in Lapack, you are also going to do n-1 wasted forward eliminations.
Apart from that, you waste memory to hold the inverse matrix.