Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Oleg_S_
Beginner
187 Views

CLUSTER_SPARSE_SOLVER segmentation fault

Hi all!

I am getting forrtl: severe (174): SIGSEGV, segmentation fault occurred during factorization step of CPARDISO. It happens when i try to solve Nonsymmetric system of ~ 8.5 million equations. It usually occurs when factorization is complete at about 80%. Everything is fine if number of equations is ~ 1 million.

My setup phase is

    NRHS = 1
    MAXFCT = 1
    MNUM = 1
    IPARM(1) = 1 ! NO SOLVER DEFAULT
    IPARM(2) = 3 ! FILL-IN REORDERING FROM METIS
    IPARM(4) = 0 ! NO ITERATIVE-DIRECT ALGORITHM
    IPARM(6) = 0 ! =0 SOLUTION ON THE FIRST N COMPONENTS OF X
    IPARM(8) = 2 ! NUMBERS OF ITERATIVE REFINEMENT STEPS
    IPARM(10) = 13 ! PERTURB THE PIVOT ELEMENTS WITH 1E-13
    IPARM(11) = 1 ! USE NONSYMMETRIC PERMUTATION AND SCALING MPS
    IPARM(13) = 1 ! MAXIMUM WEIGHTED MATCHING ALGORITHM IS SWITCHED-ON (DEFAULT FOR NON-SYMMETRIC)
    IPARM(35) = 1 ! ZERO BASE INDEXING
    ERROR = 0 ! INITIALIZE ERROR FLAG
    MSGLVL = 0 ! PRINT STATISTICAL INFORMATION
    MTYPE = 11 ! REAL UNSYMMETRIC

I use

mpiifort for the Intel(R) MPI Library 2018 Update 1 for Linux*
Copyright(C) 2003-2017, Intel Corporation.  All rights reserved.
ifort version 18.0.1

My compilation line is

mpiifort -o mpi -O3 -I${MKLROOT}/include -qopenmp MPI_3D.f90  -Wl,--start-group ${MKLROOT}/lib/intel64/libmkl_intel_lp64.a ${MKLROOT}/lib/intel64/libmkl_intel_thread.a ${MKLROOT}/lib/intel64/libmkl_core.a ${MKLROOT}/lib/intel64/libmkl_blacs_intelmpi_lp64.a -Wl,--end-group -liomp5 -lpthread -lm -ldl

I tried to run on cluster on 8 nodes.  Each node has 2x Intel(R) Xeon(R) Gold 6142 CPU @ 2.60GHz and 768Gb of RAM.

I also leave here link to my matrix in CSR3 format. Attention: unzipped files are more than 2 Gb.

https://drive.google.com/file/d/1HOeg-eF0iAlSx533AOi91pfCA6YMFHNd/view?usp=sharing

0 Kudos
12 Replies
Kirill_V_Intel
Employee
187 Views

Hello Oleg!

A quick question, are you sure that the factors for your matrix fit into 32-bit integers? The matrix seems to be quite huge. If you're unsure about that, can you check and share with us the output from iparm(15)-iparm(18) after phase 1? You'll need to initialize iparm(18) < 0 to get the report for the number of nonzeros in the factors.

If the number of nonzeros in the factors exceed the limit of 32-bit integers, you'll need to switch to the ILP64 mode (64-bit integers) which would change your compilation and link lines, following the MKL Link Line Advisor (https://software.intel.com/en-us/articles/intel-mkl-link-line-advisor/).

Thanks,
Kirill

 

Oleg_S_
Beginner
187 Views

Hello Kirill!

Thanks for quick response! Here is some summary about my matrix:

< Linear system Ax = b >
             number of equations:           8470528
             number of non-zeros in A:      68307067
             number of non-zeros in A (%): 0.000095

             number of right-hand sides:    1

And here is something strange when i run phase = 11 and use different number of MPI processes. iparm(18) always return negative value and iparm(15)-iparm(17) are different:

np = 1

< Factors L and U >
             number of columns for each panel: 72
             number of independent subgraphs:  0
             number of supernodes:                    4396346
             size of largest supernode:               48712
             number of non-zeros in L:                22704680039
             number of non-zeros in U:                22536224271
             number of non-zeros in L+U:              45240904310
MY RANK IS 0 IPARM(15) = 18771844
MY RANK IS 0 IPARM(16) = 13991103
MY RANK IS 0 IPARM(17) = 355963495
MY RANK IS 0 IPARM(18) = -2003735946

np = 2

< Factors L and U >
             number of columns for each panel: 128
             number of independent subgraphs:  0
             number of supernodes:                    4382494
             size of largest supernode:               48712
             number of non-zeros in L:                23015897339
             number of non-zeros in U:                22776795225
             number of non-zeros in L+U:              45792692564
MY RANK IS 0 IPARM(15) = 17464877
MY RANK IS 0 IPARM(16) = 13395041
MY RANK IS 0 IPARM(17) = 181358623
MY RANK IS 0 IPARM(18) = -1451947692
MY RANK IS 1 IPARM(15) = 0
MY RANK IS 1 IPARM(16) = 0
MY RANK IS 1 IPARM(17) = 0
MY RANK IS 1 IPARM(18) = -1

np = 4

< Factors L and U >
             number of columns for each panel: 128
             number of independent subgraphs:  0
             number of supernodes:                    4389074
             size of largest supernode:               48712
             number of non-zeros in L:                22824983785
             number of non-zeros in U:                22584778823
             number of non-zeros in L+U:              45409762608
MY RANK IS 0 IPARM(15) = 17376019
MY RANK IS 0 IPARM(16) = 13355027
MY RANK IS 0 IPARM(17) = 91915182
MY RANK IS 0 IPARM(18) = -1834877648
MY RANK IS 1 IPARM(15) = 0
MY RANK IS 1 IPARM(16) = 0
MY RANK IS 1 IPARM(17) = 0
MY RANK IS 1 IPARM(18) = -1
MY RANK IS 3 IPARM(15) = 0
MY RANK IS 3 IPARM(16) = 0
MY RANK IS 3 IPARM(17) = 0
MY RANK IS 3 IPARM(18) = -1
MY RANK IS 2 IPARM(15) = 0
MY RANK IS 2 IPARM(16) = 0
MY RANK IS 2 IPARM(17) = 0
MY RANK IS 2 IPARM(18) = -1

 

Oleg_S_
Beginner
187 Views

Some update. I tried to use 64 bit version. MSGLVL = 1 showed that factorization was at 100%, then i got another error:

forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image              PC                Routine            Line        Source             
mpi                0000000000449C3D  Unknown               Unknown  Unknown
libpthread-2.17.s  00007F43A8EBC5D0  Unknown               Unknown  Unknown
libmkl_core.so     00007F43A9827C68  mkl_serv_free         Unknown  Unknown
libmkl_intel_thre  00007F43AD121B95  mkl_pds_blkl_cpar     Unknown  Unknown
libmkl_core.so     00007F43AACFB6EC  mkl_pds_factorize     Unknown  Unknown
libmkl_core.so     00007F43AACF5D77  mkl_pds_do_all_cp     Unknown  Unknown
libmkl_core.so     00007F43AACE7A04  mkl_pds_cpardiso_     Unknown  Unknown
libmkl_core.so     00007F43AACE51DD  mkl_pds_cluster_s     Unknown  Unknown
mpi                000000000043187A  Unknown               Unknown  Unknown
mpi                0000000000439211  Unknown               Unknown  Unknown
mpi                000000000040395E  Unknown               Unknown  Unknown
libc-2.17.so       00007F43A73C63D5  __libc_start_main     Unknown  Unknown
mpi                0000000000403869  Unknown               Unknown  Unknown
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image              PC                Routine            Line        Source             
mpi                0000000000449C3D  Unknown               Unknown  Unknown
libpthread-2.17.s  00007FA6B53B25D0  Unknown               Unknown  Unknown
libmkl_core.so     00007FA6B5D1DC68  mkl_serv_free         Unknown  Unknown
libmkl_intel_thre  00007FA6B9617B95  mkl_pds_blkl_cpar     Unknown  Unknown
libmkl_core.so     00007FA6B71E4C13  mkl_pds_factorize     Unknown  Unknown
libmkl_core.so     00007FA6B71DAFD2  mkl_pds_cluster_s     Unknown  Unknown
mpi                000000000043187A  Unknown               Unknown  Unknown
mpi                0000000000439211  Unknown               Unknown  Unknown
mpi                000000000040395E  Unknown               Unknown  Unknown
libc-2.17.so       00007FA6B38BC3D5  __libc_start_main     Unknown  Unknown
mpi                0000000000403869  Unknown               Unknown  Unknown
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image              PC                Routine            Line        Source             
mpi                0000000000449C3D  Unknown               Unknown  Unknown
libpthread-2.17.s  00007FA804C485D0  Unknown               Unknown  Unknown
libmkl_core.so     00007FA8055B3C68  mkl_serv_free         Unknown  Unknown
libmkl_intel_thre  00007FA808EADB95  mkl_pds_blkl_cpar     Unknown  Unknown
libmkl_core.so     00007FA806A7AC13  mkl_pds_factorize     Unknown  Unknown
libmkl_core.so     00007FA806A70FD2  mkl_pds_cluster_s     Unknown  Unknown
mpi                000000000043187A  Unknown               Unknown  Unknown
mpi                0000000000439211  Unknown               Unknown  Unknown
mpi                000000000040395E  Unknown               Unknown  Unknown
libc-2.17.so       00007FA8031523D5  __libc_start_main     Unknown  Unknown
mpi                0000000000403869  Unknown               Unknown  Unknown
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image              PC                Routine            Line        Source             
mpi                0000000000449C3D  Unknown               Unknown  Unknown
libpthread-2.17.s  00007FE2983905D0  Unknown               Unknown  Unknown
libmkl_core.so     00007FE298CFBC68  mkl_serv_free         Unknown  Unknown
libmkl_intel_thre  00007FE29C5F5B95  mkl_pds_blkl_cpar     Unknown  Unknown
libmkl_core.so     00007FE29A1C2C13  mkl_pds_factorize     Unknown  Unknown
libmkl_core.so     00007FE29A1B8FD2  mkl_pds_cluster_s     Unknown  Unknown
mpi                000000000043187A  Unknown               Unknown  Unknown
mpi                0000000000439211  Unknown               Unknown  Unknown
mpi                000000000040395E  Unknown               Unknown  Unknown
libc-2.17.so       00007FE29689A3D5  __libc_start_main     Unknown  Unknown
mpi                0000000000403869  Unknown               Unknown  Unknown

 

I compiled this with

mpiifort -o mpi -O0 -g -mkl=parallel  MPI_3D.f90 -lmkl_blacs_intelmpi_ilp64 -liomp5 -lpthread -lm -ldl

At static linking almost everything is unknown.

Kirill_V_Intel
Employee
187 Views

Hello again!

1) I'm afraid I still don't like your link line. The option "-mkl=parallel" links against the LP64 libraries so in your last try you actually had a mix of lp64 (from -mkl=parallel) and ilp64 libraries (blacs library which you have).

Thus, I strongly recommend that you stop using "-mkl" option and instead try to follow our Link line Advisor:
https://software.intel.com/en-us/articles/intel-mkl-link-line-advisor/.

If you check there, it says that you need to add -i8 flag to your compilation line and use the link line with options
-Wl,--start-group ${MKLROOT}/lib/intel64/libmkl_intel_ilp64.a ${MKLROOT}/lib/intel64/libmkl_intel_thread.a ${MKLROOT}/lib/intel64/libmkl_core.a ${MKLROOT}/lib/intel64/libmkl_blacs_intelmpi_ilp64.a -Wl,--end-group -liomp5 -lpthread -lm -ldl

2) It's ok that for different number of processes you see different amount of memory required in iparm, some memory allocation depend on the number of processes.

3) As I see, the #nnz in the factors for your matrix do exceed the 32-bit integers. But our sparse solver should have handled that without ILP64 mode. However, now I start thinking whether your input matrix fit into the limits. Can you tell us also whether your input matrix is distributed among MPIs (iparm(40)=?)? What is the total number of nonzeros in the matrix A and the local numbers (for each process), for the configuration which crashed?

Summarizing that: could you try the ilp64 mode with correct compilation and link lines and could you also give us some more information about the matrix?

I hope this helps!

Best,
Kirill

Oleg_S_
Beginner
187 Views

Hello Kirill!

Thank you for helping me. 

1) Now compilation goes static:

mpiifort -o mpi -O3 -i8 -I${MKLROOT}/include -qopenmp MPI_3D.f90  -Wl,--start-group ${MKLROOT}/lib/intel64/libmkl_intel_ilp64.a ${MKLROOT}/lib/intel64/libmkl_intel_thread.a ${MKLROOT}/lib/intel64/libmkl_core.a ${MKLROOT}/lib/intel64/libmkl_blacs_intelmpi_ilp64.a -Wl,--end-group -liomp5 -lpthread -lm -ldl

output finishes at:

Percentage of computed non-zeros for LL^T factorization
 1 %  2 %  3 %  4 %  5 %  6 %  11 %  12 %  13 %  14 %  15 %  19 %  21 %  22 %  23 %  24 %  29 %  30 %  31 %  32 %  34 %  38 %  39 %  40 %  41 %  44 %  46 %  47 %  48 %  49 %  50 %  51 %  53 %  58 %  59 %  60 %  61 %  62 %  63 %  64 %  65 %  66 %  67 %  68 %  69 %  70 %  73 %  75 %  77 %  78 %  80 %  83 %  85 %  88 %  89 %  90 %  91 %  92 %  93 %  94 %  95 %  96 %  97 %  98 %  99 %  100 % 
[mpiexec@node79] Exit codes: [node79] 44544
[node80] 0
[node81] 0
[node82] 0

and errors are:

forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image              PC                Routine            Line        Source             
mpi                0000000004A9CC2D  Unknown               Unknown  Unknown
libpthread-2.17.s  00007F93A09555D0  Unknown               Unknown  Unknown
mpi                000000000045F988  Unknown               Unknown  Unknown
mpi                00000000008D61B5  Unknown               Unknown  Unknown
mpi                000000000059F7AC  Unknown               Unknown  Unknown
mpi                0000000000504417  Unknown               Unknown  Unknown
mpi                000000000048A1B4  Unknown               Unknown  Unknown
mpi                000000000042D09D  Unknown               Unknown  Unknown
mpi                0000000000424F68  Unknown               Unknown  Unknown
mpi                0000000000407F0C  Unknown               Unknown  Unknown
mpi                0000000000406C7C  Unknown               Unknown  Unknown
mpi                000000000040681E  Unknown               Unknown  Unknown
libc-2.17.so       00007F939EC2A3D5  __libc_start_main     Unknown  Unknown
mpi                0000000000406729  Unknown               Unknown  Unknown
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image              PC                Routine            Line        Source             
mpi                0000000004A9CC2D  Unknown               Unknown  Unknown
libpthread-2.17.s  00007F97DAD685D0  Unknown               Unknown  Unknown
mpi                000000000045F988  Unknown               Unknown  Unknown
mpi                00000000008D61B5  Unknown               Unknown  Unknown
mpi                00000000004886F3  Unknown               Unknown  Unknown
mpi                000000000042CE92  Unknown               Unknown  Unknown
mpi                0000000000424F68  Unknown               Unknown  Unknown
mpi                0000000000407F0C  Unknown               Unknown  Unknown
mpi                0000000000406C7C  Unknown               Unknown  Unknown
mpi                000000000040681E  Unknown               Unknown  Unknown
libc-2.17.so       00007F97D903D3D5  __libc_start_main     Unknown  Unknown
mpi                0000000000406729  Unknown               Unknown  Unknown
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image              PC                Routine            Line        Source             
mpi                0000000004A9CC2D  Unknown               Unknown  Unknown
libpthread-2.17.s  00007F7BE27605D0  Unknown               Unknown  Unknown
mpi                000000000045F988  Unknown               Unknown  Unknown
mpi                00000000008D61B5  Unknown               Unknown  Unknown
mpi                00000000004886F3  Unknown               Unknown  Unknown
mpi                000000000042CE92  Unknown               Unknown  Unknown
mpi                0000000000424F68  Unknown               Unknown  Unknown
mpi                0000000000407F0C  Unknown               Unknown  Unknown
mpi                0000000000406C7C  Unknown               Unknown  Unknown
mpi                000000000040681E  Unknown               Unknown  Unknown
libc-2.17.so       00007F7BE0A353D5  __libc_start_main     Unknown  Unknown
mpi                0000000000406729  Unknown               Unknown  Unknown
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image              PC                Routine            Line        Source             
mpi                0000000004A9CC2D  Unknown               Unknown  Unknown
libpthread-2.17.s  00007F0D5EDD55D0  Unknown               Unknown  Unknown
mpi                000000000045F988  Unknown               Unknown  Unknown
mpi                00000000008D61B5  Unknown               Unknown  Unknown
mpi                00000000004886F3  Unknown               Unknown  Unknown
mpi                000000000042CE92  Unknown               Unknown  Unknown
mpi                0000000000424F68  Unknown               Unknown  Unknown
mpi                0000000000407F0C  Unknown               Unknown  Unknown
mpi                0000000000406C7C  Unknown               Unknown  Unknown
mpi                000000000040681E  Unknown               Unknown  Unknown
libc-2.17.so       00007F0D5D0AA3D5  __libc_start_main     Unknown  Unknown
mpi                0000000000406729  Unknown               Unknown  Unknown

 

2) It is strange that in post 3 there are different numbers of non zeroes in L,U and L+U matrixes.

3) iparm(40) = 0, matrix is stored at rank 0. I already showed you some info about my matrix:

< Linear system Ax = b >
             number of equations:           8470528
             number of non-zeros in A:      68307067
             number of non-zeros in A (%): 0.000095

             number of right-hand sides:    1

I can add that is ill conditioned, it has a lot of zeroes on main diagonal. The portrait of matrix of a lower size is:

matr3dd.jpg

 

Gennady_F_Intel
Moderator
187 Views

Oleg, it looks like the real issue. You may submit the ticket to the Intel Online service center where you may share all your input ( reproducer, input matrix and others details ) regard to this problem.

Gennady_F_Intel
Moderator
187 Views

OSC - https://supporttickets.intel.com/?lang=en-US

and here is how to create the access: https://software.intel.com/en-us/articles/how-to-create-a-support-request-at-online-service-center

Oleg_S_
Beginner
187 Views

Well, this is disappointing.

Oleg_S_
Beginner
187 Views

Our records indicate that you do not have a supported product associated with your account. You need a supported product in order to qualify for Priority Support. As such, this ticket is being closed.

Gennady_F_Intel
Moderator
187 Views

it means you don't have a valid COM license. You may give us the reproducer and input matrix for your case from this thread or just communicate with me via email if you don't like to share all details onto the open forum.

Oleg_S_
Beginner
187 Views

Hello, Gennady

I didn't find your email. Can you share it with me?

Gennady_F_Intel
Moderator
187 Views

yes. i will