Solved: cluster_sparse_solver fails on some matrices unless I turn on matching ( iparm[12] = 1 )

segmentation_fault · ‎11-30-2021

I am testing cluster_sparse_solver on a suite of 500+ small matrices ( < 1000 equations ) from civil, mechanical and electrical engineering area. The good news is 95% of the matrices solve ok. Unfortunatley 5% error out with segmentation faults. However, if I turn on matching ( iparm[12] =1 ) then these matrices solve ok.

Another strange thing is these matrices that fail will solve ok if I only use one mpi thread ( mpirun - np 1 ./myapp ). The error only appears if I do mpirun -np 2 ./myapp

I would prefer not to turn on matching in my application since it is time consuming for large matrices. Often it takes longer than the factorization times for matrices with more than a million equations.

I have created an example that reproduces the issue which you can download from here or see the attached files:

https://calculix.feacluster.com/intel/matching_error.tar

mpiicc -g -DMKL_ILP64 -L${MKLROOT}/lib/intel64 -lmkl_intel_ilp64 -lmkl_intel_thread -lmkl_core -lmkl_blacs_intelmpi_ilp64 -liomp5 -lpthread -lm -ldl cluster_solver_matching.c

// will run ok:
mpirun --check-mpi -np 1 ./a.out

// will error out:
mpirun --check-mpi -np 2 ./a.out

Uncomment this line and re-run ( both cases will pass, i.e -np 1 and -np 2 )
//    iparm[12] =  1; /* Switch on Maximum Weighted Matching algorithm (default for non-symmetric) */

Gennady_F_Intel · ‎12-01-2021

the issue is confirmed and escalated.

View solution in original post

Kirill_V_Intel · ‎11-30-2021

Hi!

Thanks for reporting the problem and providing a reproducer! I confirm the issue. In fact, it is present for older releases too.

There is a explanation why enabling matching helps. Enabling matching is internally bound to a quite different code path so that I'm pretty sure that it is not a positive effect of matching itself but rather taking a different code path for other things.

Also, the fact why with matching it so much slower can be explained by the same thing.

There is a quite dirty workaround suggestion: instead of enabling matching, you can distribute your input matrices across MPI with nonzero overlapping in terms of rows.

Then I expect that the execution will take the same code path without the bug, but the matching will not be used. Maybe this will then not have drastic performance difference.

Best,
Kirill

segmentation_fault · ‎12-01-2021

Thanks for the detailed background on the issue! For now, I will just have users set an environment variable to turn on matching if cluster_sparse_solver crashes. See example code:

    env = getenv("PARDISO_MPI_MATCHING");

    if ( env ) {
        int PARDISO_MPI_MATCHING = atoi ( env );
        if ( PARDISO_MPI_MATCHING == 1 )  { iparm[12] = 1; }
    } // endif

I think trying to distribute the matrix to the different ranks will be quite complicated and messy. That may also add some time thereby offsetting the penalty from turning on matching. But I will keep it in mind if there is a quick and easy way to do it.

Gennady_F_Intel · ‎12-01-2021

the issue is confirmed and escalated.

Gennady_F_Intel · ‎12-01-2021

This thread is closing. If you require additional assistance from Intel, please start a new thread. Any further interaction in this thread will be considered community only. When the original issue would be fixed, we will update this thread accordingly.

thanks,

Gennady