Intel® oneAPI Math Kernel Library
Ask questions and share information with other developers who use Intel® Math Kernel Library.
6653 Discussions

cluster_sparse_solver fails on some matrices unless I turn on matching ( iparm[12] = 1 )

segmentation_fault
New Contributor I
834 Views

I am testing cluster_sparse_solver on a suite of 500+ small matrices ( < 1000 equations ) from civil, mechanical and electrical engineering area. The good news is 95% of the matrices solve ok. Unfortunatley 5% error out with segmentation faults. However, if I turn on matching ( iparm[12] =1 ) then these matrices solve ok.

 

Another strange thing is these matrices that fail will solve ok if I only use one mpi thread ( mpirun - np 1 ./myapp ). The error only appears if I do mpirun -np 2 ./myapp

 

I would prefer not to turn on matching in my application since it is time consuming for large matrices. Often it takes longer than the factorization times for matrices with more than a million equations.

 

I have created an example that reproduces the issue which you can download from here or see the attached files:

https://calculix.feacluster.com/intel/matching_error.tar

 

mpiicc -g -DMKL_ILP64 -L${MKLROOT}/lib/intel64 -lmkl_intel_ilp64 -lmkl_intel_thread -lmkl_core -lmkl_blacs_intelmpi_ilp64 -liomp5 -lpthread -lm -ldl cluster_solver_matching.c

// will run ok:
mpirun --check-mpi -np 1 ./a.out

// will error out:
mpirun --check-mpi -np 2 ./a.out

Uncomment this line and re-run ( both cases will pass, i.e -np 1 and -np 2 )
//    iparm[12] =  1; /* Switch on Maximum Weighted Matching algorithm (default for non-symmetric) */

 

 

0 Kudos
1 Solution
Gennady_F_Intel
Moderator
792 Views

the issue is confirmed and escalated.


View solution in original post

4 Replies
Kirill_V_Intel
Employee
809 Views

Hi!

 

Thanks for reporting the problem and providing a reproducer! I confirm the issue. In fact, it is present for older releases too. 

There is a explanation why enabling matching helps. Enabling matching is internally bound to a quite different code path so that I'm pretty sure that it is not a positive effect of matching itself but rather taking a different code path for other things.

Also, the fact why with matching it so much slower can be explained by the same thing.

 

There is a quite dirty workaround suggestion: instead of enabling matching, you can distribute your input matrices across MPI with  nonzero overlapping in terms of rows.

Then I expect that the execution will take the same code path without the bug, but the matching will not be used. Maybe this will then not have drastic performance difference.

 

Best,
Kirill

segmentation_fault
New Contributor I
761 Views

Thanks for the detailed background on the issue! For now, I will just have users set an environment variable to turn on matching if cluster_sparse_solver crashes. See example code:

 

    env = getenv("PARDISO_MPI_MATCHING");

    if ( env ) {
        int PARDISO_MPI_MATCHING = atoi ( env );
        if ( PARDISO_MPI_MATCHING == 1 )  { iparm[12] = 1; }
    } // endif

 

I think trying to distribute the matrix to the different ranks will be quite complicated and messy. That may also add some time thereby offsetting the penalty from turning on matching. But I will keep it in mind if there is a quick and easy way to do it.

Gennady_F_Intel
Moderator
793 Views

the issue is confirmed and escalated.


Gennady_F_Intel
Moderator
752 Views

This thread is closing. If you require additional assistance from Intel, please start a new thread. Any further interaction in this thread will be considered community only. When the original issue would be fixed, we will update this thread accordingly.

thanks,

Gennady



Reply