Solved: cluster_sparse_solver_64

Pan11 · ‎04-26-2022

I am finding that cluster_sparse_solver_64 fails to compute the correct solution when the number of non zeros exceeds 2147483647. Attached is a modification of the the cl_solver_sym_f.f example to demonstrate the issue. The example is modified to use a full symmetric matrix of size n. For n=65000 (nnz=2114576615) the solver succeeds. For n=67500 (nnz=2278158750), it fails. Tests fail/pass regardless of mpisize.

Tests require about 200Gb Ram (if run a single node) and take several hours to run.

Thanks

Khang_N_Intel · ‎05-23-2022

Hi Pan,

I saw that you filed a ticket of this same issue in the online service center.

I will communicate with you in the online service center.

This thread will be closed.

Best regards,

Khang

View solution in original post

VidyalathaB_Intel · ‎04-27-2022

Hi Michaleris,

Thanks for reaching out to us.

Could you please let us know the MKL version you are working with?

We suggest you try the new oneMKL 2022.0 in case you are using the older version and see if it helps.

Please get back to us if the issue still persists even with the latest MKL with the steps to reproduce the issue (commands to compile and run) so that we can check it from our end.

Regards,

Vidya.

Pan11 · ‎04-27-2022

Hi Vidya, thanks for following up.

I have used version 2020.4.304 with the compile options below:

mpiifort -O4 -fpp -qopenmp -c cl_solver_sym_f.f

mpiifort -L/opt/intel/mkl/lib/intel64 cl_solver_sym_f.o -Wl,--start-group "/opt/intel/compilers_and_libraries/linux/mkl/lib/intel64"/libmkl_blacs_intelmpi_lp64.a "/opt/intel/compilers_and_libraries/linux/mkl/lib/intel64"/libmkl_intel_lp64.a "/opt/intel/compilers_and_libraries/linux/mkl/lib/intel64"/libmkl_core.a "/opt/intel/compilers_and_libraries/linux/mkl/lib/intel64"/libmkl_intel_thread.a -Wl,--end-group -L "/opt/intel/compilers_and_libraries/linux/mkl/../compiler/lib/intel64" -liomp5 -mt_mpi -lm -o pdstest

The test was run on an Dell 7920 running the latest version of Redhat 8, running:

mpirun -ppn 2 pdstest

Will compile again with oneMKL 2022.0 and report in a day or two.

Thanks, Pan

VidyalathaB_Intel · ‎04-27-2022

Hi Pan,

>>Will compile again with oneMKL 2022.0 and report in a day or two

Yeah sure. You can download oneAPI base toolkit from where you can get oneMKL 2022 and get the latest compilers by downloading oneAPI HPC toolkit.

Here are the links to download

oneAPI Base Toolkit:

https://www.intel.com/content/www/us/en/developer/tools/oneapi/base-toolkit-download.html

oneAPI HPC Toolkit:

https://www.intel.com/content/www/us/en/developer/tools/oneapi/hpc-toolkit-download.html

This time you can compare the example of cl_solver_sym_f and see if there are any changes. Additionally, you can make use of Link Line advisor to get the recommended libraries for your particular use case.

Here is the link:

https://www.intel.com/content/www/us/en/developer/tools/oneapi/onemkl-link-line-advisor.html#gs.yqjzoe

Regards,

Vidya.

Pan11 · ‎04-27-2022

Thanks Vidya,

Just compiled run with oneMKL 2022.0. It crashed with the following message:

n= 67500
nnz= 2278158750
n= 67500
nnz= 2278158750
Abort(2169359) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Bcast: Other MPI error, error stack:
PMPI_Bcast(453).........................: MPI_Bcast(buf=0xa109880, count=38564, MPI_LONG_LONG_INT, root=1, comm=comm=0x84000005) failed
PMPI_Bcast(438).........................:
MPIDI_Bcast_intra_composition_delta(603):
MPIDI_POSIX_mpi_bcast(131)..............:
MPIR_Bcast_intra_binomial(133)..........: message sizes do not match across processes in the collective routine: Received 151600 but expected 308512

Pan11 · ‎04-27-2022

with OneAPI even the original cl_solver_sym_f.f fails, even running one process:

*** Error in PARDISO memory allocation: FACT_ADR, size to allocate: 141659056 bytes
The local (internal) PARDISO version is : 176
Minimum degree algorithm at reorder step is turned ON
Time spent in symbolic factorization (symbfct) :
Total time spent :

Parallel METIS algorithm at reorder step is turned ON
=== (null): solving a Hermitian indefinite system ===

forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
stest 000000000867BCBA Unknown Unknown Unknown
libpthread-2.28.s 00007F1691A88C20 Unknown Unknown Unknown
libc-2.28.so 00007F169180E767 Unknown Unknown Unknown
libc-2.28.so 00007F16917030AF _IO_vfprintf Unknown Unknown
libc-2.28.so 00007F169172A784 vsnprintf Unknown Unknown
stest 00000000004B38BE Unknown Unknown Unknown
stest 00000000004A4593 Unknown Unknown Unknown
stest 000000000043D0E9 Unknown Unknown Unknown
stest 000000000042E14A Unknown Unknown Unknown
stest 000000000040D19B Unknown Unknown Unknown
stest 0000000000407EB5 Unknown Unknown Unknown
stest 0000000000406229 Unknown Unknown Unknown
stest 0000000000406022 Unknown Unknown Unknown
libc-2.28.so 00007F16916D4493 __libc_start_main Unknown Unknown
stest 0000000000405F2E Unknown Unknown Unknown

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 1 PID 48948 RUNNING AT panopt
= KILLED BY SIGNAL: 9 (Killed)
===================================================================================

Pan11 · ‎04-27-2022

Vidya, please, ignore previous message. I linked to the old libraries before. Linking to the new ones does not crash. Will come back tomorrow with results of the test.

Thanks, Pan

Pan11 · ‎04-27-2022

Confirming that using oneAPI test fails for non zeros exceeding 2147483647 similarly to the older versions. You should be able to replicate this at your end.

VidyalathaB_Intel · ‎04-28-2022

Hi Pan,

Thanks for letting us know.

We are working on your issue. we will get back to you soon with an update.

Regards,

Vidya.

Pan11 · ‎04-29-2022

Hi Vidya,

Thanks for looking into this. One more thing to add is that for n=65000, nnz=2112532500, the test complied with oneAPI results into segfault during phase 22. With the old compiler it succeeds. So, the solver got worse with the oneAPI, now it even crashes for nnz less than 2147483647.

Regards, Pan

Pan11 · ‎05-09-2022

Hi Vidya, any progress on this? Have you been able to replicate the issue at your end? Thanks, Pan

Khang_N_Intel · ‎05-10-2022

Hi Pan,

I am trying to find a system with 200GB of RAM to see if I can reproduce the issue that you mentioned.

If I want to test the code on a cluster then any specific cluster requirement for the code to run?

I also noticed the following:

1) You link to the 32-bit integer of the libraries libmkl_intel_lp64.a and libmkl_blacs_intelmpi_lp64.a.

Let switch to 64-bit integer, instead using libmkl_intel_ilp64.a and libmkl_blacs_intelmpi_ilp64.a

2) It seems like there is problem with the stack size. Why don't you increase the stack size using the command: ulimit

Best,

Khang

Pan11 · ‎05-11-2022

Hi Khang:

It should not be so hard to find a machine with 200GB ram. cluster_sparse_solver_64 is intended for solving vary large systems that could require TB's of ram on several nodes. I setup the test to use a full matrix which requires the smallest memory consumption so that one can easily replicate with issues. You can try n=60000 that results to nnz=1800030000 which still sadly crashes with the OneAPI compiler.

The OneAPI version is so bad, that it crashes running one process on a single node. I tested the older version on a small cluster running 10G tcp.

libmkl_intel_ilp64.a is not acceptable as it would slow down the rest of my code and further increase memory consumption.

The tests were already performed with ulimit unlimited

Thanks, Pan

Khang_N_Intel · ‎05-13-2022

Hi Pan,

I just want to let you know that I am waiting for permission to access the cluster system in order to confirm your issue.

In the mean time, can you tell me the version number of the Intel MPI that you are using?

Also, do you still see this same issue when changing the number of RANK to greater than 2?

Best,

Khang

Pan11 · ‎05-13-2022

which mpirun
/opt/intel/oneapi/mpi/2021.5.1/bin/mpirun

which mpiifort
/opt/intel/oneapi/mpi/2021.5.1/bin/mpiifort

Waiting for results mpirun -ppn 4

Nevertheless, segfault with 1 or 2 ranks is still not acceptable.

Can you put me in contact with the folks that maintain/develop the pardiso/cluster_sparse_solver? I could provide them quite a lot feedback on the state of these routines.

Thanks, Pan

Pan11 · ‎05-14-2022

[pan@panopt v4]$ mpirun -ppn 4 pdstest
rank= 2 n= 60000
rank= 2 nnz= 1800030000
rank= 3 n= 60000
rank= 3 nnz= 1800030000
rank= 1 n= 60000
rank= 1 nnz= 1800030000
rank= 0 n= 60000
rank= 0 nnz= 1800030000
Memory allocated on phase 11 on Rank # 0 107523.4479 MB
Memory allocated on phase 11 on Rank # 1 77333.9227 MB
Memory allocated on phase 11 on Rank # 2 72119.4802 MB
Memory allocated on phase 11 on Rank # 3 66969.6388 MB
Reordering completed ...
Number of non-zeros in L on Rank # 0 1368775758
Number of non-zeros in U on Rank # 0 1
Number of non-zeros in L on Rank # 1 341990777
Number of non-zeros in U on Rank # 1 1
Number of non-zeros in L on Rank # 2 68154240
Number of non-zeros in U on Rank # 2 1
Number of non-zeros in L on Rank # 3 22996740
Number of non-zeros in U on Rank # 3 1
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
pdstest 00000000070967BA Unknown Unknown Unknown
libpthread-2.28.s 00007F3838BE0C20 Unknown Unknown Unknown
pdstest 0000000000C5B2C3 Unknown Unknown Unknown
libiomp5.so 00007F38388ECBB3 __kmp_invoke_micr Unknown Unknown
libiomp5.so 00007F3838868903 Unknown Unknown Unknown
libiomp5.so 00007F3838867912 Unknown Unknown Unknown
libiomp5.so 00007F38388ED83C Unknown Unknown Unknown
libpthread-2.28.s 00007F3838BD617A Unknown Unknown Unknown
libc-2.28.so 00007F383616DDF3 clone Unknown Unknown

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 1 PID 396051 RUNNING AT panopt
= KILLED BY SIGNAL: 9 (Killed)
===================================================================================

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 2 PID 396052 RUNNING AT panopt
= KILLED BY SIGNAL: 9 (Killed)
===================================================================================

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 3 PID 396053 RUNNING AT panopt
= KILLED BY SIGNAL: 9 (Killed)
===================================================================================
[pan@panopt v4]$

Khang_N_Intel · ‎05-18-2022

Hi Pan,

I just tested your code with the latest version of oneMKL, 2022.1, and was able to confirm the issue.

I ran the code with 2 ranks it crashed with the error : "Bad termination of one of your application..."

The developer will look into this issue.

I will let you know what we find out about this issue.

Best regards,

Khang

Pan11 · ‎05-19-2022

Hi Khang, thanks for confirming.

I would appreciate if you could forward to the developer the following:

1) The segfault is a new issue that came up withe the OneMKL versions of mpirun

2) You can run my code as a single process without mpirun to avoid the segfault and you will find out the cluster_sparse_solver_64 fails to provide the correct solution for nnz > 2278158750:

pdstest
rank= 0 n= 67500
rank= 0 nnz= 2278158750
Memory allocated on phase 11 174106.3752 MB
Reordering completed ...
Number of non-zeros in L 2280281613
Number of non-zeros in U 1
Memory allocated on phase 22 193103.5310 MB

Percentage of computed non-zeros for LL^T factorization
13 % 21 % 28 % 36 % 42 % 49 % 55 % 61 % 66 % 71 % 76 % 80 % 83 % 87 % 90 % 92 % 95 % 96 % 98 % 99 % 100 %
Factorization completed ...
Solve completed ...
The solution of the system is
Relative residual = 3.849001794597505E-003
Error: residual is too high!

TEST FAILED
1

3) You can change the code to call pardiso_64 instead of cluster_sparse_solver_64 and then find out that pardiso_64 does indeed provide the correct solution for nnz > 2278158750:

pdst
rank= 0 n= 67500
rank= 0 nnz= 2278158750

=== PARDISO: solving a symmetric indefinite system ===
1-based array indexing is turned ON
PARDISO double precision computation is turned ON
Parallel METIS algorithm at reorder step is turned ON

Summary: ( reordering phase )
================

Times:
======
Time spent in calculations of symmetric matrix portrait (fulladj): 48.290804 s
Time spent in reordering of the initial matrix (reorder) : 167.526839 s
Time spent in symbolic factorization (symbfct) : 19.645896 s
Time spent in data preparations for factorization (parlist) : 0.041810 s
Time spent in allocation of internal data structures (malloc) : 0.066780 s
Time spent in additional calculations : 109.997327 s
Total time spent : 345.569456 s

Statistics:
===========
Parallel Direct Factorization is running on 47 OpenMP

< Linear system Ax = b >
number of equations: 67500
number of non-zeros in A: 2278158750
number of non-zeros in A (%): 50.000741

number of right-hand sides: 1

< Factors L and U >
number of columns for each panel: 64
number of independent subgraphs: 0
number of supernodes: 1055
size of largest supernode: 67500
number of non-zeros in L: 2280284560
number of non-zeros in U: 1
number of non-zeros in L+U: 2280284561
Reordering completed ...
=== PARDISO is running in In-Core mode, because iparam(60)=0 ===

Percentage of computed non-zeros for LL^T factorization
4 % 8 % 9 % 18 % 25 % 33 % 40 % 46 % 52 % 58 % 63 % 69 % 73 % 77 % 81 % 85 % 88 % 91 % 93 % 95 % 97 % 98 % 99 % 100 %

=== PARDISO: solving a symmetric indefinite system ===
Single-level factorization algorithm is turned ON

Summary: ( factorization phase )
================

Times:
======
Time spent in copying matrix to internal data structure (A to LU): 0.000000 s
Time spent in factorization step (numfct) : 104.567792 s
Time spent in allocation of internal data structures (malloc) : 0.000041 s
Time spent in additional calculations : 0.000001 s
Total time spent : 104.567834 s

Statistics:
===========
Parallel Direct Factorization is running on 47 OpenMP

< Linear system Ax = b >
number of equations: 67500
number of non-zeros in A: 2278158750
number of non-zeros in A (%): 50.000741

number of right-hand sides: 1

< Factors L and U >
number of columns for each panel: 64
number of independent subgraphs: 0
number of supernodes: 1055
size of largest supernode: 67500
number of non-zeros in L: 2280284560
number of non-zeros in U: 1
number of non-zeros in L+U: 2280284561
gflop for the numerical factorization: 102661.226374

gflop/s for the numerical factorization: 981.767179

Factorization completed ...

=== PARDISO: solving a symmetric indefinite system ===

Summary: ( solution phase )
================

Times:
======
Time spent in direct solver at solve step (solve) : 4.053454 s
Time spent in additional calculations : 10.894469 s
Total time spent : 14.947923 s

Statistics:
===========
Parallel Direct Factorization is running on 47 OpenMP

< Linear system Ax = b >
number of equations: 67500
number of non-zeros in A: 2278158750
number of non-zeros in A (%): 50.000741

number of right-hand sides: 1

< Factors L and U >
number of columns for each panel: 64
number of independent subgraphs: 0
number of supernodes: 1055
size of largest supernode: 67500
number of non-zeros in L: 2280284560
number of non-zeros in U: 1
number of non-zeros in L+U: 2280284561
gflop for the numerical factorization: 102661.226374

gflop/s for the numerical factorization: 981.767179

Solve completed ...
The solution of the system is
Relative residual = 0.000000000000000E+000

TEST PASSED

Khang_N_Intel · ‎05-20-2022

Hi Pan,

Thank you for providing additional information about this issue!

Yes, the developers are aware of this latest info.

Best regards,

Khang

Khang_N_Intel · ‎05-20-2022

Hi Pan,

Looking at the item 2 and 3 in your message, it seems like you set the rank to 1. Is that correct?

I am wondering if the code would exhibit the same behavior if you set the rank to greater than 1.

Thanks,

Khang

Pan11 · ‎05-20-2022

Hi Khang,

I run items 2 and 3 without mpirun. Just the executable from compilation/linking process. The reasons are the following:

I have already tested mpirun with 2 or more ranks and they all fail for large nnz
for item 2, I wanted to bypass mpirun that causes the segfault, and check if the solver still fails to provide the correct solution as I had tested with older versions of MKL
for item 3, pardiso does not use mpi, so more ranks would be irrelevant.

As I mentioned before, I would love to get in contact with the developers as I have done extensive testing with both pardiso and cluster_sparse_solver routines and could provide more info that may be relevant. For example for pardiso_64, if you use the two-level factorization algorithm (fortran: iparm(24)=1), it also fails to provide to provide the correct solution for nnz > 2278158750, same way as the cluster_sparse_solver_64.

Please, let me know if you need my contact info.

Regards,

Pan