Pardiso : looking for some information to speed up my code

velvia · ‎09-29-2011

Hi,

Let me explain my problem first. I want to solve a non-linear partial differential equation of the kind :

dP/dt = div(k grad(P))

where k is a function of P. For that, I use a fully implicit scheme that leads to solving an equation of the kind :

f_n(P(n+1), P(n)) = 0

where f_n is a non linear function and P(n) represents the state of my problem at time t(n). In order to solve that, I use a Newton method where I need to solve some linear equations. The code has to handle 1D, 2D and 3D cases, and I call k the number of volume elements.

The matrix that I have to deal with has always the same size (k by k), the shape for the position of nonzeros, that is to say, for 1D a tridiagonal matrix, in 2d a pentadiagonal matrix (the width of the band begin k^(1/2)) and in 3d a matrix with 7 diagonals (the width of the band being k^(2/3)). But the elements of the matrix are changing form iteration to iteration. To solve this matrix, i use a parallel version of Pardiso, running on 8 cores. What I started to do is :

- Do the symbolic factorization only once at the beginning

- For each iteration, do the LU decomposition and the substitution.

I ran into the following problems : the results were not accurate. I then realized that the fact that the numbers in the matrix were changing were at the origin of my problem. So I did some symbolic factorization from time to time (each time the Newton method does not converge). Then It worked. It seems that doing the substitution takes a 1/4 of the time of doing the LU decomposition.

Then What I've tried to play with iparm(4) : I do a LU decomposition for the first step of the Newton iteration and I use an iterative method for the other Newton iteration. It uses the first LU decomposition as a preconditionner. I was very surprised that this method does not speed up, neither slow down the code. It just runs at the same speed.

So, here are my questions :

- Do I really need to do a symbolic factorization from time to time to prevent the loss of accuracy, or is there any other way ?

- Why does the iterative process does not speed up the code ?

- Any idea on how to improve the speed ?

- I still do not understand how pardiso can parallelize the LU-decomposition and the substitution. Can I find some general information about how those things are parallelized ?

Thanks for reading,

Francois

Sergey_K_Intel1 · ‎09-30-2011

Dear Francois,
Please find my answers below

- Do I really need to do a symbolic factorization from time to time to prevent the loss of accuracy, or is there any other way ?

According to your descriptionyou don't need to do a symbolic factorization used at phase=11since values of sparse matrices are only changing. But certainly you have to do a numerical factorization from time to time by callingPARDISO with phase=22 or phase=23.

- Why does the iterative process does not speed up the code ?

Would you pleaselet us knowthe value of iparm(20) which is CG/CGS diagnostics? I'd assumeit happens becauseyou call PARSDISO with phase=23, CG fails to converge and as a consequence PARDISO has to factorize the matrix again.

Let me remind that CG/CGS in PARDISO uses thefollowing strategy

"If phase =23, then the factorization for a given A is automatically recomputed in cases where the Krylow-Subspace iteration failed, and the corresponding direct solution is returned. Otherwise the solution from the preconditioned Krylow-Subspace iteration is returned. Using phase =33 results in an error message (error =-4) if the stopping criteria for the Krylow-Subspace iteration can not be reached. More information on the failure can be obtained from iparm(20)."

- Any idea on how to improve the speed ?

Would you please provide more details about the platforms, the matrix type and PARDISO settings (iparm values from 1 to 30 elements)?

- I still do not understand how pardiso can parallelize the LU-decomposition and the substitution. Can I find some general information about how those things are parallelized ?

Here are a couple ofreferences where you can finda general description of the numerical methods used in PARDISO

O. Schenk, K. Gartner, and W. Fichtner. Efficient Sparse LU Factorization with Left-right Looking Strategy on Shared Memory Multiprocessors. BIT, 40(1):158-176, 2000.

O. Schenk and K. Gartner. Two-level scheduling in PARDISO: Improved Scalability on Shared Memory Multiprocessing Systems. Parallel Computing, 28:187-197, 2002.

More references can be found in the MKL Reference Guide.

All the best
Sergey

velvia · ‎09-30-2011

Dear Sergey,

Thank you very much for taking the time to answer my questions.

1 - Symbolic Factorization

Let me be more precise about the fact that the values of the "non-zero" elements are the only ones that change. The vectors "ia" and "ja" never change from one call to another, but the values of the elements stored in "a" change at each iteration, and some of them can be equal to 0 in some of them.

What I found out, is that when I do only one symbolic factorization at the beginning, the results given by pardiso when solving A.x = y are less accurate than the ones I get when I do a symbolic factorization for every iteration. That's why I have decided to do a symbolic factorization from time to time, which solved my problem (The percentage of non-zeros in the array "a" start at 25% and slowly grows up to 100%).

By the way, it seems that pardiso crashes when you set a = 0 during the phase=11, meaning that the symbolic factorization also use those values (Perhaps to choose the pivoting sequence).

2 - Iterative process

It seems that it converges because with iparm(4)=31, I get iparm(20)=1, 2 or 3.

3 - Improving speed

I develop on the following platform :

Computer 1 : The operating system is Linux and a cat /proc/cpu gives me a processor count of 8, withIntel Xeon CPU E5410 @2.33GHz, the memory being 64Gb

In the end, I would like to target the following platform (Which I don't develop on because there is a queue system that makes it a pain to use for debugging)

Computer 2 : The operating system is Linux, and a cat /proc/cpu gives me a processor count of 512, withIntel Xeon CPU X7560 @ 2.27GHz, the memory being 2To (It is a SGI Altrix)

I use the following iparms values :

allocate(this%iparm(64))

this%iparm = 0

this%iparm(1) = 1 ! Don't use default values

this%iparm(2) = 2 ! Use the METIS reordering scheme

this%iparm(5) = 0 ! Don't use a user permutation

this%iparm(6) = 0 ! Write the solution on another vector

this%iparm(8) = 0 ! Iterative refinement step

this%iparm(9) = 0 ! Nothing here, but should be set to 0

this%iparm(10) = 13 ! Pivoting permutation (1.0e-13)

this%iparm(11) = 1 ! Scaling vector

this%iparm(12) = 0 ! Solve the problem A.x = b

this%iparm(13) = 1 ! improved accuracy using (non-)symmetric weighted matchings

this%iparm(18) = -1 ! Report the number of nonzeros in the factors

this%iparm(19) = 0 ! Don't compute the MFLOPS (it increases computation time)

this%iparm(21) = 0 ! pivoting for symmetric indefinite matrices (Useless for unsymmetric matrices)

this%iparm(24) = 1 ! Parallel factorization control (new version)

this%iparm(25) = 0 ! parallel forward/backward solve control

this%iparm(27) = 0 ! Does not check the matrix

this%iparm(28) = 0 ! Work with double precision

this%iparm(31) = 0 ! Don't expect a sparse second member

this%iparm(32) = 0 ! Reserved for future version

this%iparm(33) = 0 ! Reserved for future version

this%iparm(34) = 0 ! Reserved for future version

this%iparm(35) = 0 ! First value of the matrix is referred as 1 (Fortran style)

this%iparm(60) = 0 ! Use in-core Pardiso (Don't use hard drive for memory)

this%iparm(61) = 0 ! Reserved for future version

this%iparm(62) = 0 ! Reserved for future version

this%iparm(63) = 0 ! Reserved for future version

this%iparm(64) = 0 ! Reserved for future version

For iparm(4), I use 0 for the first call to pardiso of every Newton iteration, and 31 for the following calls.

I forgot to tell you that the structure of the matrix (given by "ia" and "ja") is symmetric but the values in "a" make the matrix unsymmetric. Therefore, I use mtype = 11.

The target for the size of the matrix is 10^6. I can send you a description of the non-zero pattern, or an example of the matrix in a text file. Which Matrix format do you want ? Is the format : i, j, a_{i,j} good for you ?

I'll look into the papers you give me. The title seems to really correspond to what I am looking for.

Francois

Sergey_K_Intel1 · ‎10-03-2011

Deat Francois,

Thanks a lof for the information.

As concerns a crash in symbolic factorization when a=0 it happens because of the settings
this%iparm(11) = 1 ! Scaling vector

this%iparm(13) = 1 ! improved accuracy using (non-)symmetric weighted matchings
These settings requires values for the symbolic factorization and since a=0, all scaling factors are zeros as well.

Normally it is recommended to keep only nonzero elements in thesparse matrix. Soit would be betterto avoid many zeros in the sparse matrix and I'd propose toform new"ia" and "ja" from time to time.Probably it helps to reduce the factorization time andthe usagethe CGS solver fromPARDISOwith PHASE=33 for a few next iterationswill allow you to reduce the time.

Would you please toset MSGLVL (the message level information) to 1 and post the zipped log file? By the way what is the total number of the Newton iterations with and without CGS iterations?

Thanks in advance
All the best
Sergey

velvia · ‎10-03-2011

Dear Sergey,

For the symbolic factorization phase, thank you for the information. I've decided to do a symbolic factorization once in a while (every time the Newton method does not converge) without changing ia and ja. If I want to change them, it is going to be a major rewrite of important part of the code. I'll do that later.

So you know, if the matrix is n by n, and the problem is 2d, the number of "non-zero" elements (the length of the array a) is around 10n and the number of actual non-zeros (the number of non-zeros in the array a) is in between 5n and 10.

For the number of newton-iterations with and without CGS, It seems to be almost the same.

Here is a log of :

First Newton iteration :Symbolic - LU - solve

Second Newton : CGS

Third Newton : CGS

All the best

Francois

Sergey_K_Intel1 · ‎10-04-2011

Dear Francois,

Thanks a lot. Now I see what is the problem and itmight bea bug in CGS since PARDISO should not have recomputed the LU factorsin your case. Would you please provideatest case with data toreproduce the problem?

Please let us know your e-mail addressbyprivate postand we will inform you the results of investigation.

Thanks in advance
All the best
Sergey

velvia · ‎10-05-2011

Dear Sergey,

I will work on giving you a test case. Could you please tell me where in my log you see that the LU decomposition is computed where it should not be ?

Francois

Sergey_K_Intel1 · ‎10-06-2011

Dear Francois,

I'm not completely sure andI need to check it out.But the solver step for your system generallytakes

Times:
======
Time solve : 0.036523 s

So2 CGS iterations should take about 0.072 plus somevery small time for additonal computation.I'd estimatethe upper limit for CGS as 0.1 sec.In your log I see

iterative : 31 1

================ PARDISO: solving a real nonsymmetric system ================

Summary PARDISO: ( factorize to solve )
================

Times:
======
Time cgs : 0.203449 s cgx iterations 2

It might be the statistics is wrong (say the number iterations is not correct) or PARDISO recomputed the LU factors.

All the best
Sergey

velvia · ‎10-12-2011

Dear Sergey,

I think I have made a mistake in the way I was using Pardiso. What I used to do is to set iparm(4) to 0 for the first Newton iteration, and then change it to 31 for the other ones. It seems that it is a mistake, and that one should set iparm(4) to 31 from the very beginning. Pardiso is then taking care of doing the LU-decomposition for the first iteration and doing an iterative method for the next ones.

Here is the log I have when I set iparm(4) to 31 from the very beginning. I don't think there is a bug in Pardiso. What is your take on that ?

Francois

velvia · ‎10-17-2011

Dear Sergey,

Here is an example code if you want to test it out.

Francois

Sergey_K_Intel1 · ‎10-19-2011

Dear Francois,

Sorry for the delay taken in answering and many thanks for providing the test code.

The last usage model is correct.

I investigated you code and I've not found any bug in PARDISO. To improve the performance of your code. I'd reccomend to use the parallel METIS by setting iparm(2)=3 and it essentiallyreduces the reordering time.

Your code also work well if we turn off scaling and matching by setting

    iparm(11) = 0  ! Scaling vector


    iparm(13) = 0  ! improved accuracy using (non-)symmetric weighted matchings

It looks like your system is well-conditioned but  I'd rather check how it works in the whole application.







I'll let you know if I found further performance improvements.


 
All the best
Sergey

velvia · ‎10-27-2011

Hi Sergey,

Thanks for your tips.

So you are not confused, the code I sent you is not the code I am running. I just can't send you the code I run because I am not allowed to. So the matrix is not exactly the one I've sent you. By the way, how do you check the condition number with pardiso ?

I have tried to change iparm(11) and iparm(13) as you've said (Even just one of them) but it does not reduce the computation time. It even increases it (as there are more Newton iteration).

Best regards,

Franois