Intel® oneAPI Math Kernel Library
Ask questions and share information with other developers who use Intel® Math Kernel Library.
6956 Discussions

How to use MPI version of the reordering algorithm ( phase 11 of cluster_sparse_solver )?

segmentation_fault
New Contributor I
1,402 Views

I am following the guide here to set iparm(2)=10 to use the MPI version of the nested dissection and symbolic factorization algorithms.

 

https://www.intel.com/content/www/us/en/develop/documentation/onemkl-developer-reference-fortran/top/sparse-solver-routines/parallel-direct-sp-solver-for-clusters-iface/cluster-sparse-solver-iparm-parameter.html

 

However, I am confused on how to set iparm(41) and iparm(42). Say I want to run on two compute nodes ( 2 mpi processes ). Do I set iparm(41) to one and iparm(42) to half the number of equations (n)? I tried doing this but got an error:

 

iparm[40] = 1;
int upper_bound = (n+1)/2;
printf ( "upper bound is %i", upper_bound );
iparm[41] = upper_bound;


*** Error in PARDISO  ( sequence_ido,parameters) error_num= 11
*** Input check: preprocessing 2 (out of bounds)
*** Input parameters: inconsistent error= 11 max_fac_store_in: 1
          matrix_number_in: 1 matrix_type_in: -2
          ido_in          : 11 neqns_in      : 3867819
          ia(neqns_in+1)-1: 89286447 nb_in         : 1

ERROR during symbolic factorization: -1[proxy:0:0@cn406] pmi cmd from fd 6: cmd=finalize
[proxy:0:0@cn406] PMI response: cmd=finalize_ack

0 Kudos
1 Solution
Gennady_F_Intel
Moderator
1,195 Views

iparm[40] and iparm[41

wrt to iparm[40] and iparm[41] ( when iparm[39]==2): 

might be looking at the $MKLROOT\example\cluster_sparse_solversc\ 

cl_solver_unsym_distr_c.c will help to see how to properly distribute the inputs across nodes.



Regarding the performance of the reordering stage:

in this case, if the inputs size feet with your one node RAM size, that the SMP version of 

reordering will always be faster compared with MPI (iparm[1]==10) mode,

otherwise, it looks like a real problem which we could investigate.



View solution in original post

9 Replies
VidyalathaB_Intel
Moderator
1,372 Views

Hi,

 

Thanks for reaching out to us.

 

Could you please let us know the programming language on which you are working?

We see that you are following oneMKL guide for fortran language and from the error log it seems like you are applying them to your C language code.

 

>>I am following the guide here to set iparm(2)=10 to use the MPI version of the nested dissection and symbolic factorization algorithms.

 

In C, you need to set iparm[1]=10 in order to use the MPI version of the nested dissection and symbolic factorization algorithms.

 

Kindly refer to the below guide

https://www.intel.com/content/www/us/en/develop/documentation/onemkl-developer-reference-c/top/sparse-solver-routines/parallel-direct-sp-solver-for-clusters-iface/cluster-sparse-solver-iparm-parameter.html

 

If you still face any issues, please provide us a sample reproducer (and steps if any) along with your environment details and MKL version so that we can work on it from our end.

 

Regards,

Vidya.

 

0 Kudos
segmentation_fault
New Contributor I
1,345 Views

I am coding in C and incorrectly linked to the fortran guide. In any case, my question is related to understanding iparm[40] and iparm[41]. How to set those based on the number of hosts ?

 

However, I have read more of the documentation and think I have figured it out now. I believe there might be an error or omission in the documentation. When setting iparm[1] = 10, one must also set iparm[39] > 0 . This is not mentioned in the documentation for iparm[1] .

 

 

0 Kudos
VidyalathaB_Intel
Moderator
1,328 Views

Hi,


>> I believe there might be an error or omission in the documentation. When setting iparm[1] = 10, one must also set iparm[39] > 0 ....This is not mentioned in the documentation for iparm[1]


We are looking into this issue. we will get back to you soon.


Regards,

Vidya.


0 Kudos
Gennady_F_Intel
Moderator
1,314 Views

iparm[39] == 0 means that the matrix is located into the master node, but the computation happens across many MPI processes.


0 Kudos
segmentation_fault
New Contributor I
1,254 Views

I understand that if I set parm[39] = 1 , then the  matrix will be distributed across the nodes. However, I am confused on how to set iparm[40] and iparm[41] ( the beginning and end of the input domain ). Here's what I tried doing in the code block below, but my phase 11 and factorization time doubled and the solution vector was wrong.

 

My reason for wanting to try iparm[1] = 10 ( The MPI version of the nested dissection and symbolic factorization algorithms.  ) is because my phase 11 is taking too long. It takes longer than my factorization times ( phase 22 ). So I am looking to try anything that can reduce phase 11 times. I already tried setting iparm[1] = 3 to use the the parallel version of the nested dissection algorithm. I got some speedup of phase 11 of around 10-20%.

 

 

 

MKL_INT n = 1219161;
MKL_INT ndim = 46687911;

iparm[40] = 1;
int upper_bound = (n+1)/2;
printf ( "upper bound is %i", upper_bound );
iparm[41] = upper_bound;

 

 

 

 

 

0 Kudos
Gennady_F_Intel
Moderator
1,196 Views

iparm[40] and iparm[41

wrt to iparm[40] and iparm[41] ( when iparm[39]==2): 

might be looking at the $MKLROOT\example\cluster_sparse_solversc\ 

cl_solver_unsym_distr_c.c will help to see how to properly distribute the inputs across nodes.



Regarding the performance of the reordering stage:

in this case, if the inputs size feet with your one node RAM size, that the SMP version of 

reordering will always be faster compared with MPI (iparm[1]==10) mode,

otherwise, it looks like a real problem which we could investigate.



segmentation_fault
New Contributor I
1,182 Views

Thanks. I guess I was under the impression the distributed reordering  (iparm[1]=10) would be faster than the SMP version  (iparm[1]=3) . Since it is not, I will pass on it.  Perhaps it's a throwback to the days when machines had limited RAM..

0 Kudos
Kirill_V_Intel
Employee
1,176 Views

In fact,

I slightly disagree with the statement about distributed reordering vs SMP version. These versions are implemented in quite different ways so I would not be confidently say that SMP version is always faster. In fact, I would suggest checking performance for both of them, if reordering performance is important. 

 

Also, off the top of my head, I don't think that distributed reordering should require iparm[39] != 0. Are you sure it doesn't work with iparm[39] = 0?

 

Thanks,
Kirill

Gennady_F_Intel
Moderator
1,061 Views

This issue has been resolved and we will no longer respond to this thread. If you require additional assistance from Intel, please start a new thread. Any further interaction in this thread will be considered community only. 



0 Kudos
Reply