Two questions about 'iparm[1]=10' & 'cluster_sparse_solver' speed

YONGHEE_L_ · ‎01-05-2017

Hi dear Intel
I'm the user of 'MKL2017 update 1' and 'MPICH3.1.4'.
Now a days, I tried to solve the large SPD sparse matrix containing on the order 10^8 rows.
Therefore, I'm troubled with reducing the process time.

In MKL2017 version, newly introduced parameter, iparm[1]=10, seems to be helping me.
However, I can not find any other example or instruction about this new parameter.

I tried to conduct the example code involved in the MKL applying this new parameter, but this code was stopped with no message.
Could you please show me a good example using 'iparm[1]=10'?

Thank you very much in advance!!!

Regards,
Yong-hee

P.S. In large SPD sparse matrix solving, 'cluster_sparse_solver_64' with MPI shows me so further slow result than the result with OpenMP at the same number of activated core. (OpenMP uses 'pardiso_64')
Is this a general situation in solving matrix with MPI?
And is there a way to increase the speed of solver for very large matrix using MPI better than OpenMP?

Gennady_F_Intel · ‎01-07-2017

Yong-hee, have you look at cl_solver_unsym_distr_c.c example ( mklroot\examples\cluster_sparse_solverc\source\ folder )? This example shows the case when initial data (matrix and rhs) are distributed between several MPI processes, final solution is distributed between MPI processes in the same way as they hold initial data.

YONGHEE_L_ · ‎01-10-2017

.

YONGHEE_L_ · ‎01-10-2017

At first, I saw the introduction post of 'iparm[1]=10' (https://software.intel.com/en-us/articles/distributed-nested-dissection-algorithm-for-intel-mkl-parallel-direct-sparse-solver-for).
And I misunderstood that 'iparm[1]=10' can seperate the matrix without intersections by specifying the iparm[40] and iparm[41] in each node of cluster.

Now, the code is working properly with those new parameter, and they show very nice results with respect to memory usage like the graph in the introduction post of 'iparm[1]=10'.
The processing time, however, is increased very much.
Especially reorder time is quintupled in comparison to the result of 'cl_solver_unsym_distr_c.c' with the matrix having 36 million elements. (140 s -> 680 s @ reordering)

Do you think that I missed important something to improve my cluster code?
Thank you in advance for your support, again. :)

Regards,
Yong-hee