cluster_sparse_solver cause segmentation fault in mkl 11.3

Chaowen_G_ · ‎09-23-2015

Hi:

My environment: linux64, mpicxx for MVAPICH2 version 2.0b, icpc version 13.1.3 (gcc version 4.7.0 compatibility). In order not to confuse with the mkl library in icpc version 13.1.3, I put the mkl 11.3 in /home/intel.

I use the following command:

mpic++ cluster_sparse_solverc/source/cl_solver_unsym_c.c -Wl,-rpath=/home/intel/mkl/lib/intel64 -Wl,-rpath=/home/intel/compiler/lib/intel64 -L/home/intel/mkl/lib/intel64 -L/home/intel/compiler/lib/intel64 -lmkl_intel_lp64 -lmkl_core -lmkl_intel_thread -lmkl_blacs_intelmpi_lp64 -liomp5

to compile and cause segmentation fault. But in mkl 11.2.4, it is totally correct. So is it a bug in mkl 11.3?

MariaZh · ‎09-23-2015

Hi,

Can you please provide us a reproducer for your case, so that we can investigate it more carefully?

Best regards,
Maria.

Gennady_F_Intel · ‎09-23-2015

Chaowen, thanks for the case, we will check and let you know the results, I only see you use MPICH beta and pretty aged version of compiler, We validated and check this functionality with the latest version of icc.

Chaowen_G_ · ‎09-24-2015

I use the example in mkl examples, that is mkl/examples/examples_cluster.tgz and then extract it and then use cluster_sparse_solverc/source/cl_solver_unsym_c.c as the source code

I use the following command to run it

export MV2_DEBUG_SHOW_BACKTRACE=1

export MV2_DEBUG_CORESIZE=unlimited

mpiexec -n 2 ./a.out

print out:

=== CPARDISO: solving a real nonsymmetric system ===
1-based array indexing is turned ON
CPARDISO double precision computation is turned ON
METIS algorithm at reorder step is turned ON
Scaling is turned ON
Matching is turned ON

Summary: ( reordering phase )
================

Times:
======
Time spent in calculations of symmetric matrix portrait (fulladj): 0.000024 s
Time spent in reordering of the initial matrix (reorder)         : 0.000355 s
Time spent in symbolic factorization (symbfct)                   : 0.000255 s
Time spent in data preparations for factorization (parlist)      : 0.000010 s
Time spent in allocation of internal data structures (malloc)    : 0.000415 s
Time spent in additional calculations                            : 0.000037 s
Total time spent                                                 : 0.001096 s

Statistics:
===========
Parallel Direct Factorization is running on 2 MPI and 1 OpenMP per MPI process

< Linear system Ax = b >
             number of equations:           5
             number of non-zeros in A:      13
             number of non-zeros in A (%): 52.000000

number of right-hand sides: 1

< Factors L and U >
             number of columns for each panel: 128
             number of independent subgraphs: 0
< Preprocessing with state of the art partitioning metis>
             number of supernodes:                    2
             size of largest supernode:               4
             number of non-zeros in L:                19
             number of non-zeros in U:                2
             number of non-zeros in L+U:              21

Reordering completed ... [compute-1-1.local:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11)
[compute-1-1.local:mpi_rank_0][print_backtrace]   0: /lustre/work/prog/mvapich2-2.0b-composer/lib/libmpich.so.10(print_backtrace+0x17) [0x2aabf5e0cfe7]
[compute-1-1.local:mpi_rank_0][print_backtrace]   1: /lustre/work/prog/mvapich2-2.0b-composer/lib/libmpich.so.10(error_sighandler+0x5a) [0x2aabf5e0cfca]
[compute-1-1.local:mpi_rank_0][print_backtrace]   2: /lib64/libc.so.6() [0x36b8e32920]

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   EXIT CODE: 139
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Segmentation fault (signal 11)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions

that means it just finish phase 1, but phase 2 cause segmentation fault.

Gennady_F_Intel · ‎09-28-2015

Chaowen, I checked how this example works when I used Intel MPI instead of MVAPICH2 ( which, actually, officially is not supported by MKL - You may find the list of suppored versions of MPI into RN) .The test passed and the obtained results were corrected.