Pardiso Crash

h88433 · ‎03-27-2010

Hello,

I am trying to factorize an indefinitematrix using Pardiso with an LDL' factor. I am working on solving an eigenvalue problem (structural dynamics) where Pardiso should do the matrix algebra in the reverse communication interface for ARPACK. The LDL' factor is necessary so that I know how many eigenvalues exist within a predefined range and is an input to ARPACK. The matrix is indefinite because I have constraint equations in the problem using lagrange multipliers which places zeros on the diagonal, and the matrix is partitioned so that the zeros are in the last rows of the matrix. The iparm[] and other settings i have used are identical to those in the pardiso_sym_c.c example except that I set mtype=-2. I am using the ILP64 interface, and pardiso is defined to return an MKL_INT just like in the example.

Every time I try to run the calculation, I recieve an exception fault with phase=11. If I set iparm[26]=1 (C indexing), the analyze completes without an exception fault, however the next phase=22 again quits with an exception fault.

I have run this calculation on the same matrix using Cholmod (in this case without lagrange multipliers because Cholmod only solves Positive definite matrices), Ma57 from the HSL library, SuperLU, and CXSparse (Univ. Florida). All these have worked, and so I dont understand what I am not doing correctly. It is difficult to troubleshoot a problem like this without some kind of diagnostic informationfrom the MKL.

Any help would be appreciated, Thanks,
john

Gennady_F_Intel · ‎03-29-2010

john,

one of the problem may be that you forget to use compiler options for ILP64 support.

For example

for C/C++ Compiling for ILP64 -/DMKL_ILP64

for Fortran - /4I8

the similar - for Linux OS

--Gennady

h88433 · ‎03-31-2010

Hi Gennady,

Thanks again for your reply. In the mean time I have manage to get Pardiso running. Turn out that the problems were,

(1) I was using (0) based ia,ja array indexing insteda of (1)
(2) I assemble my FE matrices in column format, hence I have to convert them to row format for Pardiso. I still maintain that there is a problem with the MKL conversion routine mkl_dcsrcsc(). I have written my own for the time being. If you are interested in the algorithm, let me know.

PS: I have been varying some of the parameters to fine tune the performance in the Pardiso solve. I have compared the performance to ma57 (HSL). The factor times are roughly equivalent between the two, however the solve time with Pardiso causesan overall solution time of about 50% longer. The solves occur in the reverse communication loop of ARPACK where 15 natural frequencies and mode vectors are extracted with the matrix order n ~400,000 and number of non-zeros about 16 million.

I also notices thatthe number of threads being used was equal to 5 regardless of what I set iparm(3) to (I have a quad core Xenon processor with 16 GB ram, running on windows XP). Do you think there ispossibly a performance issue with this ?

The matrix has zeros on the diagonal as well (symmetric indefinite mtype=-2), can the issue have something to do with iterative refinement ?

Thanks, john

Gennady_F_Intel · ‎03-31-2010

Hi John,

there are at least 2 issues:

-one of them is regarding conversion routinemkl_dcsrcsc(). This is an unknown problem for us that this routineproducesthe wrong output. Let's discussthis topic into original thread?

- the second one is regarding Pardsio:

iparm(3) - currently is not used.Pardiso automatically detects how many CPU cores available on the system and will use all of them.

How many right-hand sides do you solve?

--Gennady

h88433 · ‎03-31-2010

Hi Gennady,

Thanks very much for your quick response. I am only solving 1 RHS. Note additionally, I just linked against mkl_sequential.lib and repeated the calculations I had done using the threading libaries mkl_intel_thread.lib, libiomp5md.lib. I compile using the intel c compiler, C99, with -MTd, optimimizations -O2, -Ot. I was quite surprised to see that with only a single thread in this calculation that the total compuation time was 459 secs. With 5 threads, the calc time was 429 secs. I have a quad core cpu. There must be something I am doing wrong ?

Thanks, john

ps: I have to locate the thread about the conversion routine, Ill get back to you later.

h88433 · ‎03-31-2010

Hi Gennady,

I just performed a calculation of the same problem using an LU factorization, mtype=11. This is possible even though the matrix has 12 zeros on the diagnonal of ~400,000 equations because it has an LU factor due to its principal minors, see wiki if necessary. This time i ran the problem on my dual core at home.

The result was a total solution time of 259 sec, and btw. the calculation ran with only one thread, not with the number of cpus, so there is a discrepany with regard to your statement about iparm(3). The link was done with the parallel libraries.

I am of course very happy with this performance, its even better than that which I achieved with ma57. I would definetely like to understand what the problem is with the lack of performance using multi-threading on that quad core cpu, and the indefinte solver LDL'.

regs, john

h88433 · ‎03-31-2010

Hi Gennady,
Sorry to overload you with replies. I just realize that my unsymmetric matrix calc was actually done with the sequential libraries. I re-linked with the parallel versions and ran the calc again. This time 3 threads on my dual core cpu were active. The result took 288 secs which is actually an increase inspite of using the parallel libraries ??
regards,john

Gennady_F_Intel · ‎04-01-2010

Hi John,

in general, if we are talking about solution phase, the performance should be the same because of this phase is not threaded. More precisely, the solution phase is treaded only for many RHS. And this performance results, of course, will not depends on which compiler options were used.

--Gennady

Gennady_F_Intel · ‎04-01-2010

John,

The total solution time what do you mean by that? Is this the all execution time for all calculation phases, say for all phase==(11 + 22 + 33) ?

Interesting - why 3 thread were running on dual core system?may be some thread oversubscription happened...

--Gennady

h88433 · ‎04-01-2010

Hi Gennady, yes phase 11,22,33. On my dual core processor, 3 threads ran with parallel libs. On my quad core processor, 5 threads ran with parallel libs,
john

h88433 · ‎04-01-2010

Hi Gennady, I was just doing some reading in the Pardiso manula (university of Basel) and read the following,

(o) Reproducibility of exact numerical results staon multi-core architectures. The solver is now able to compute the exact bit identical solution independent

on the number of cores without effecting the scalability. Here are some

results for a nonlinear FE model with 500'000 elements.

Intel MKL PARDISO 10.2

1 core - factor: 17.980 sec., solve: 1.13 sec.

2 cores - factor: 9.790 sec., solve: 1.13 sec.

4 cores - factor: 6.120 sec., solve: 1.05 sec.

8 cores - factor: 3.830 sec., solve: 1.05 sec.

U Basel PARDISO 4.0.0:

1 core - factor: 16.820 sec., solve: 1.09 sec.

2 cores - factor: 9.021 sec., solve: 0.67 sec.

4 cores - factor: 5.186 sec., solve: 0.53 sec.

8 cores - factor: 3.170 sec., solve: 0.43 sec.

This method is currently only working for symmetric indefinite matrices.

This seems to be consistent with what I am experiencing. Do we get updated versions of Pardiso from Basel ?

regards, john

Gennady_F_Intel · ‎04-02-2010

john,

How can we reproduce the problem?

Can you give us your matrix or (it's much better) the test case which will help to reproducer the problem on our side?

You can use the private thread for that.

--Gennady