topic Two questions about 'iparm[1]=10' & 'cluster_sparse_solver' speed in Intel® oneAPI Math Kernel Library

Two questions about 'iparm[1]=10' & 'cluster_sparse_solver' speed

YONGHEE_L_ — Fri, 06 Jan 2017 06:39:07 GMT

Hi dear Intel
I'm the user of 'MKL2017 update 1' and 'MPICH3.1.4'.
Now a days, I tried to solve the large SPD sparse matrix containing on the order 10^8 rows.
Therefore, I'm troubled with reducing the process time.

In MKL2017 version, newly introduced parameter, iparm[1]=10, seems to be helping me.
However, I can not find any other example or instruction about this new parameter.

I tried to conduct the example code involved in the MKL applying this new parameter, but this code was stopped with no message.
Could you please show me a good example using 'iparm[1]=10'?

Thank you very much in advance!!!

Regards,
Yong-hee

P.S. In large SPD sparse matrix solving, 'cluster_sparse_solver_64' with MPI shows me so further slow result than the result with OpenMP at the same number of activated core. (OpenMP uses 'pardiso_64')
Is this a general situation in solving matrix with MPI?
And is there a way to increase the speed of solver for very large matrix using MPI better than OpenMP?

Yong-hee, have you look at cl

Gennady_F_Intel — Sun, 08 Jan 2017 04:18:59 GMT

Yong-hee, have you look at cl_solver_unsym_distr_c.c example ( mklroot\examples\cluster_sparse_solverc\source\ folder )? This example shows the case when initial data (matrix and rhs) are distributed between several MPI processes, final solution is distributed between MPI processes in the same way as they hold initial data.

At first, I saw the

YONGHEE_L_ — Tue, 10 Jan 2017 12:04:00 GMT

At first, I saw the

YONGHEE_L_ — Tue, 10 Jan 2017 12:06:25 GMT

At first, I saw the introduction post of 'iparm[1]=10' (https://software.intel.com/en-us/articles/distributed-nested-dissection-algorithm-for-intel-mkl-parallel-direct-sparse-solver-for).
And I misunderstood that 'iparm[1]=10' can seperate the matrix without intersections by specifying the iparm[40] and iparm[41] in each node of cluster.

Now, the code is working properly with those new parameter, and they show very nice results with respect to memory usage like the graph in the introduction post of 'iparm[1]=10'.
The processing time, however, is increased very much.
Especially reorder time is quintupled in comparison to the result of 'cl_solver_unsym_distr_c.c' with the matrix having 36 million elements. (140 s -> 680 s @ reordering)

Do you think that I missed important something to improve my cluster code?
Thank you in advance for your support, again. :)

Regards,
Yong-hee