Cluster Sparse Solve Numerical Factorization Segmentation Fault

Taylor_B · ‎11-08-2017

In attempting to execute 2018's cluster sparse solve examples, I ran across a segmentation fault right after completion of "computing non-zeros for LL^T factorization."

Compiled manually using 64 bit and 32 bit integers. Linker options defined using the MKL Link Line Advisor tool. The baked in make file fails do to errors in execution.

GNU Compiler
MPICH2
OpenMP threading
System is Intel(R) Xenon CPU running Linux 64 bit.

Compile Line:

mpicc -g -L${MKLROOT}/lib/intel64 -o cl_solvr_unsym_complex cl_solver_unsym_complex_c.c -Wl,--no-as-needed -Wl,--start-group ${MKLROOT}/lib/intel64/libmkl_intel_ilp64.a ${MKLROOT}/lib/intel64/libmkl_intel_thread.a ${MKLROOT}/lib/intel64/libmkl_core.a ${MKLROOT}/lib/intel64/libmkl_blacs_intelmpi_ilp64.a ${MKLROOT}/lib/intel64/libmkl_scalapack_ilp64.a -Wl,--end-group -liomp5 -lpthread -lm -ldl -DMKL_ILP64 -m64 -I${MKLROOT}/include

I ran under valgrind and gbd with little success at uncovering the issue. Below is the message I got from the valgrind run.

Percentage of computed non-zeros for LL^T factorization
 15 %  95 %  100 %
==22595== Jump to the invalid address stated on the next line
==22595==    at 0x0: ???
==22595==  Address 0x0 is not stack'd, malloc'd or (recently) free'd
==22595==
==22595==
==22595== Process terminating with default action of signal 11 (SIGSEGV)
==22595==  Bad permissions for mapped region at address 0x0
==22595==    at 0x0: ???
==22595==

This post, https://software.intel.com/en-us/forums/intel-math-kernel-library/topic/303093, is the closest post I could find to my issue. In light of Michael Chuvelev's response, I checked my linking and compile and ended up with the same results. In a last attempt before posting here, I linked all relevant libraries under MKL's library folder that were using the right integer precision. That too, did not work.

Is there a fix in the works to this sort of error or am I missing a critical step?

Alexander_K_Intel2 · ‎11-08-2017

Hi Taylor,

Can you provide reproducer to check issue on our side?

Thanks,

Alex

Taylor_B · ‎11-08-2017

Alex,

Of course. The source code is in Intel's MKL install directory example zip files. Specifically, examples_cluster_c.tgz. Files cl_solver_unsym_distr_c.c and cl_solver_unsym_complex_c.c are the two I have run. The same comes up with the corresponding files in the examples_cluster_f.tgz.

Thanks,

Taylor

Gennady_F_Intel · ‎11-09-2017

I checked this case on my side, RH7, ILP64 mode,

$ make
mpicc -g -L/opt/intel/compilers_and_libraries_2018.0.128/linux/mkl/lib/intel64 cl_solver_unsym_complex_c.c \
-Wl,--no-as-needed -Wl,--start-group \
/opt/intel/compilers_and_libraries_2018.0.128/linux/mkl/lib/intel64/libmkl_intel_ilp64.a \
/opt/intel/compilers_and_libraries_2018.0.128/linux/mkl/lib/intel64/libmkl_intel_thread.a \
/opt/intel/compilers_and_libraries_2018.0.128/linux/mkl/lib/intel64/libmkl_core.a \
/opt/intel/compilers_and_libraries_2018.0.128/linux/mkl/lib/intel64/libmkl_blacs_intelmpi_ilp64.a \
/opt/intel/compilers_and_libraries_2018.0.128/linux/mkl/lib/intel64/libmkl_scalapack_ilp64.a \
-Wl,--end-group -liomp5 -lpthread -lm -ldl -DMKL_ILP64 -m64 -I/opt/intel/compilers_and_libraries_2018.0.128/linux/mkl/include

$ mpicc -v , mpigcc for the Intel(R) MPI Library 2018 for Linux*

mpiexec -n 2 ./a.out

here is the part of the output for brevity...

=== CPARDISO: solving a complex nonsymmetric system ===

1-based array indexing is turned ON
CPARDISO double precision computation is turned ON
METIS algorithm at reorder step is turned ON
Scaling is turned ON
Matching is turned ON

..........

...........

The solution of the system is:
x [0] = 0.174768 0.021177
x [1] = -0.176471 -0.294118
x [2] = 0.049322 0.029598
x [3] = 0.042981 -0.031409
x [4] = -0.120859 -0.170860
x [5] = -0.369347 -0.000861
x [6] = 0.091610 0.125362
x [7] = 0.223941 0.139428
Relative residual = 5.551115e-17

TEST PASSED

Gennady_F_Intel · ‎11-09-2017

mpiexec -version
Intel(R) MPI Library for Linux* OS, Version 2018 Build 20170713 (id: 17594)

Taylor_B · ‎11-09-2017

Gennady,

Thank you for checking the case on your end. Below are the versions I am working with.

$ mpiexec -version
HYDRA build details:
    Version:                                 3.0.4
    Release Date:                            Wed Apr 24 10:08:10 CDT 2013
    CC:                              gcc
    CXX:                             g++
    F77:                             ifort
    F90:                             ifort
$ mpicc -v
mpicc for MPICH version 3.0.4
Using built-in specs.
COLLECT_GCC=/usr/bin/gcc
COLLECT_LTO_WRAPPER=/usr/libexec/gcc/x86_64-redhat-linux/4.8.5/lto-wrapper
Target: x86_64-redhat-linux
$ gcc -v
Using built-in specs.
COLLECT_GCC=/usr/bin/gcc
COLLECT_LTO_WRAPPER=/usr/libexec/gcc/x86_64-redhat-linux/4.8.5/lto-wrapper
Target: x86_64-redhat-linux
Thread model: posix
gcc version 4.8.5 20150623 (Red Hat 4.8.5-16) (GCC)

Here is the runtime output I get.

$ mpiexec -n 2 cl_solvr_unsym_complex

=== CPARDISO: solving a complex nonsymmetric system ===
1-based array indexing is turned ON
CPARDISO double precision computation is turned ON
METIS algorithm at reorder step is turned ON
Scaling is turned ON
Matching is turned ON


Summary: ( reordering phase )
================

Statistics:
===========
Parallel Direct Factorization is running on 2 MPI and 32 OpenMP per MPI process

< Linear system Ax = b >
             number of equations:           8
             number of non-zeros in A:      20
             number of non-zeros in A (%): 31.250000

             number of right-hand sides:    1

< Factors L and U >
             number of columns for each panel: 72
             number of independent subgraphs:  0
< Preprocessing with state of the art partitioning metis>
             number of supernodes:                    5
             size of largest supernode:               4
             number of non-zeros in L:                27
             number of non-zeros in U:                7
             number of non-zeros in L+U:              34

Reordering completed ...
Percentage of computed non-zeros for LL^T factorization
 95 %  100 %

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   EXIT CODE: 139
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Segmentation fault (signal 11)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions

Thank you for your time,

Taylor

Alexander_K_Intel2 · ‎11-09-2017

Hi Taylor,

Can you run

cat /proc/cpuinfo

to check what processor do you use? I run test on my side and it works correctly:

HYDRA build details:
Version: 3.1.4
mpicc -O2 -g -qopenmp -D__cpardiso__ -qopenmp -DMKL_ILP64 -I/nfs/pdx/proj/mkl/MKLQA/mkl_release/mkl2018_20170720/__release_lnx/mkl/include -o cpardiso.exe ./cl_solver_unsym_complex_c.c -Wl,--no-as-needed -Wl,--start-group /nfs/pdx/proj/mkl/MKLQA/mkl_release/mkl2018_20170720/__release_lnx/mkl/lib/intel64/libmkl_intel_ilp64.a /nfs/pdx/proj/mkl/MKLQA/mkl_release/mkl2018_20170720/__release_lnx/mkl/lib/intel64/libmkl_intel_thread.a /nfs/pdx/proj/mkl/MKLQA/mkl_release/mkl2018_20170720/__release_lnx/mkl/lib/intel64/libmkl_core.a /nfs/pdx/proj/mkl/MKLQA/mkl_release/mkl2018_20170720/__release_lnx/mkl/lib/intel64/libmkl_blacs_intelmpi_ilp64.a -Wl,--end-group -lm -lpthread -qopenmp -lifcore
export OMP_NUM_THREADS=16; export KMP_AFFINITY=compact,granularity=fine; mpiexec -n 2 ./cpardiso.exe

=== CPARDISO: solving a complex nonsymmetric system ===
1-based array indexing is turned ON
CPARDISO double precision computation is turned ON
METIS algorithm at reorder step is turned ON
Scaling is turned ON
Matching is turned ON

Summary: ( reordering phase )
================

Times:
======
Time spent in calculations of symmetric matrix portrait (fulladj): 0.000005 s
Time spent in reordering of the initial matrix (reorder)         : 0.004953 s
Time spent in symbolic factorization (symbfct)                   : 0.004758 s
Time spent in data preparations for factorization (parlist)      : 0.000001 s
Time spent in allocation of internal data structures (malloc)    : 0.000111 s
Time spent in additional calculations                            : 0.001195 s
Total time spent                                                 : 0.011023 s

Statistics:
===========
Parallel Direct Factorization is running on 2 MPI and 16 OpenMP per MPI process

< Linear system Ax = b >
             number of equations:           8
             number of non-zeros in A:      20
             number of non-zeros in A (%): 31.250000

number of right-hand sides: 1

< Factors L and U >
             number of columns for each panel: 72
             number of independent subgraphs: 0
< Preprocessing with state of the art partitioning metis>
             number of supernodes:                    5
             size of largest supernode:               4
             number of non-zeros in L:                27
             number of non-zeros in U:                7
             number of non-zeros in L+U:              34

Reordering completed ...
Percentage of computed non-zeros for LL^T factorization
95 % 100 %

=== CPARDISO: solving a complex nonsymmetric system ===
Single-level factorization algorithm is turned ON

Summary: ( factorization phase )
================

Times:
======
Time spent in copying matrix to internal data structure (A to LU): 0.000000 s
Time spent in factorization step (numfct)                        : 0.450965 s
Time spent in allocation of internal data structures (malloc)    : 0.000016 s
Time spent in additional calculations                            : 0.000002 s
Total time spent                                                 : 0.450983 s

Statistics:
===========
Parallel Direct Factorization is running on 2 MPI and 16 OpenMP per MPI process

< Linear system Ax = b >
             number of equations:           8
             number of non-zeros in A:      20
             number of non-zeros in A (%): 31.250000

number of right-hand sides: 1

< Factors L and U >
             number of columns for each panel: 72
             number of independent subgraphs: 0
< Preprocessing with state of the art partitioning metis>
             number of supernodes:                    5
             size of largest supernode:               4
             number of non-zeros in L:                27
             number of non-zeros in U:                7
             number of non-zeros in L+U:              34
             gflop   for the numerical factorization: 0.000000

gflop/s for the numerical factorization: 0.000001

Factorization completed ...
Solving system...
=== CPARDISO: solving a complex nonsymmetric system ===

Summary: ( solution phase )
================

Times:
======
Time spent in direct solver at solve step (solve)                : 0.326002 s
Time spent in additional calculations                            : 0.670013 s
Total time spent                                                 : 0.996015 s

Statistics:
===========
Parallel Direct Factorization is running on 2 MPI and 16 OpenMP per MPI process

< Linear system Ax = b >
             number of equations:           8
             number of non-zeros in A:      20
             number of non-zeros in A (%): 31.250000

number of right-hand sides: 1

< Factors L and U >
             number of columns for each panel: 72
             number of independent subgraphs: 0
< Preprocessing with state of the art partitioning metis>
             number of supernodes:                    5
             size of largest supernode:               4
             number of non-zeros in L:                27
             number of non-zeros in U:                7
             number of non-zeros in L+U:              34
             gflop   for the numerical factorization: 0.000000

gflop/s for the numerical factorization: 0.000001

The solution of the system is:
x [0] = 0.174768 0.021177
x [1] = -0.176471 -0.294118
x [2] = 0.049322 0.029598
x [3] = 0.042981 -0.031409
x [4] = -0.120859 -0.170860
x [5] = -0.369347 -0.000861
x [6] = 0.091610 0.125362
x [7] = 0.223941 0.139428
Relative residual = 7.343435e-17

TEST PASSED

Taylor_B · ‎11-09-2017

Alex,

processor       : 0                                                                                                
vendor_id       : GenuineIntel                                                                                         
cpu family      : 6                                                                                                    
model           : 45                                                                                                   
model name      : Intel(R) Xeon(R) CPU E5-4650L 0 @ 2.60GHz                                                            
stepping        : 7                                                                                                    
microcode       : 0x710                                                                                                
cpu MHz         : 1291.164                                                                                             
cache size      : 20480 KB                                                                                             
physical id     : 0                                                                                                    
siblings        : 16                                                                                                   
core id         : 0                                                                                                    
cpu cores       : 8                                                                                                    
apicid          : 0                                                                                                    
initial apicid  : 0                                                                                                    
fpu             : yes                                                                                                  
fpu_exception   : yes                                                                                                  
cpuid level     : 13                                                                                                   
wp              : yes                                                                                                  
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse
 sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc 
aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2a
pic popcnt tsc_deadline_timer aes xsave avx lahf_lm ida arat pln pts dtherm tpr_shadow vnmi flexpriority ept vpid xsave
opt                                                                                                                    
bogomips        : 5199.65                                                                                              
clflush size    : 64                                                                                                   
cache_alignment : 64                                                                                                   
address sizes   : 46 bits physical, 48 bits virtual

Thanks,

Taylor

Gennady_F_Intel · ‎11-09-2017

hmm, this is sandyBridge, but i checked the same problem on IvyBridge and see no problems.

lscpu

Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 20
On-line CPU(s) list: 0-19
Thread(s) per core: 1
Core(s) per socket: 10
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 62
Model name: Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz
Stepping: 4
CPU MHz: 1699.140
CPU max MHz: 3600.0000
CPU min MHz: 1200.0000
BogoMIPS: 5586.45
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 25600K
NUMA node0 CPU(s): 0-9
NUMA node1 CPU(s): 10-19
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm epb tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms xsaveopt dtherm arat pln pts

Taylor_B · ‎11-13-2017

Gennady and Alex,

Have either of you ran into a case where the executable compiles and links, but crashes when it tries to jump to a NULL address? If so, do you know the cause of the bad address being used and how to fix the issue?

-Taylor