topic Re: PARDISO- scalability & CGS questions in Intel® oneAPI Math Kernel Library

PARDISO- scalability & CGS questions

michel_lestrade — Tue, 05 Jan 2010 22:09:09 GMT

Hi,

I have 2 sets of questions on PARDISO.

The first is that there is a claim of:

For sufficiently large problem sizes, numerical experiments demonstrate that the scalability of the parallel algorithm is nearly independent of the shared-memory multiprocessing architecture and a speedup of up to seven using eight processors has been observed.

Is there an example available that we can use to reproduce this ourselves ? Do we need a dual-socket machine to see it (since MKL does not benefit from hyperthreading) ? Is this speedup possible for unsymmetric matrices (mtype=11) or just for symmetric cases ?

The second set of questions regards the use of CGS. I am not sure if we are doing the right thing and a complete example using this method would be appreciated.

At the moment, our application code is set up so phase=11 is called only once. On subsequent calls, we would like to use CGS whenever possible and avoid numeric factorization.

Does that mean that we should call phase=23 with CGS (iparm(4)=61) ? In that case, how should we deal with possible failures of the CGS iteration ? The manual says:

If phase =23, then the factorization
for a given A is automatically recomputed in these cases
where the Krylow-Subspace iteration failed, and the
corresponding direct solution is returned.

Does that mean that upon failure of the CGS, we have to restart with phase=23 with ipam(4)=0 or does PARDISO do that on its own ? I think it is the latter but some of my colleagues read this passage differently.

If the restart is automatic, is the iterative refinement being done at the end or should we call phase=33 just to be sure ?

Thanks.

Michel Lestrade
Crosslight Software

Re: PARDISO- scalability & CGS questions

Konstantin_A_Intel — Mon, 11 Jan 2010 09:46:55 GMT

Hi Michel,
I will try to answer all your questions.

>>>Is there an example available that we can use to reproduce this ourselves ?
There are a lot of publicy available matrices, e.g., here:

http://www.cise.ufl.edu/research/sparse/matrices/list_by_nnz.html

Just select some big enough.

>>>Do we need a dual-socket machine to see it (since MKL does not benefit from hyperthreading)?
Currently, yes.

>>>Is this speedup possible for unsymmetric matrices (mtype=11) or just for symmetric cases ?
For both.

>>>Does that mean that we should call phase=23 with CGS (iparm(4)=61) ? In that case, how should we deal with possible failures of the CGS iteration ?
Yes, it is.

>>>Does that mean that upon failure of the CGS, we have to restart with phase=23 with ipam(4)=0 or does PARDISO do that on its own ?
PARDISO must provide solution either via CGS or direct method automatically, no needto restart.

>>>If the restart is automatic, is the iterative refinement being done at the end or should we call phase=33 just to be sure ?
Additional calls do not needed.

I hope answered your questions!

Regards,
Konstantin

Re: PARDISO- scalability & CGS questions

michel_lestrade — Mon, 11 Jan 2010 19:23:15 GMT

Thanks for the confirmation of the automatic restart. That simplifies the code considerably although it doesn't look like the CGS iteration is failing too often ...

Still haven't gotten everything worked out as I want them but my next questions are about the performance. It turns out about 60% of the time spend on my calls to PARDISO involve rearranging the matrix beforehand to the CSR format: the rest of the code is in coordinate format.

At the moment, we are using our own converter but I would like to see if mkl_dcsrcoo() can help speed things along. Unfortunately, it seems like the ja and acsr ouputs are not sorted by column number so it is not suitable for PARDISO use. Is there any way to fix that, short of a post-processing sort ?

Also, the help documentation does not specify if we can re-arrange the matrix in-place. Our existing converter does allow for this and it would be a nice-to-have feature, especially for large matrices. Are we really forced to trade away speed for storage size in this kind of conversion ?

Regards,

Michel Lestrade
Crosslight Software