- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

I am finding that cluster_sparse_solver_64 fails to compute the correct solution when the number of non zeros exceeds 2147483647. Attached is a modification of the the cl_solver_sym_f.f example to demonstrate the issue. The example is modified to use a full symmetric matrix of size n. For n=65000 (nnz=2114576615) the solver succeeds. For n=67500 (nnz=2278158750), it fails. Tests fail/pass regardless of mpisize.

Tests require about 200Gb Ram (if run a single node) and take several hours to run.

Thanks

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Hi Pan,

I saw that you filed a ticket of this same issue in the online service center.

I will communicate with you in the online service center.

This thread will be closed.

Best regards,

Khang

Link Copied

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Hi Michaleris,

Thanks for reaching out to us.

Could you please let us know the MKL version you are working with?

We suggest you try the new oneMKL 2022.0 in case you are using the older version and see if it helps.

Please get back to us if the issue still persists even with the latest MKL with the steps to reproduce the issue (commands to compile and run) so that we can check it from our end.

Regards,

Vidya.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Hi Vidya, thanks for following up.

I have used version 2020.4.304 with the compile options below:

mpiifort -O4 -fpp -qopenmp -c cl_solver_sym_f.f

mpiifort -L/opt/intel/mkl/lib/intel64 cl_solver_sym_f.o -Wl,--start-group "/opt/intel/compilers_and_libraries/linux/mkl/lib/intel64"/libmkl_blacs_intelmpi_lp64.a "/opt/intel/compilers_and_libraries/linux/mkl/lib/intel64"/libmkl_intel_lp64.a "/opt/intel/compilers_and_libraries/linux/mkl/lib/intel64"/libmkl_core.a "/opt/intel/compilers_and_libraries/linux/mkl/lib/intel64"/libmkl_intel_thread.a -Wl,--end-group -L "/opt/intel/compilers_and_libraries/linux/mkl/../compiler/lib/intel64" -liomp5 -mt_mpi -lm -o pdstest

The test was run on an Dell 7920 running the latest version of Redhat 8, running:

mpirun -ppn 2 pdstest

Will compile again with oneMKL 2022.0 and report in a day or two.

Thanks, Pan

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Hi Pan,

>>*Will compile again with oneMKL 2022.0 and report in a day or two*

Yeah sure. You can download oneAPI base toolkit from where you can get oneMKL 2022 and get the latest compilers by downloading oneAPI HPC toolkit.

Here are the links to download

oneAPI Base Toolkit:

https://www.intel.com/content/www/us/en/developer/tools/oneapi/base-toolkit-download.html

oneAPI HPC Toolkit:

https://www.intel.com/content/www/us/en/developer/tools/oneapi/hpc-toolkit-download.html

This time you can compare the example of cl_solver_sym_f and see if there are any changes. Additionally, you can make use of Link Line advisor to get the recommended libraries for your particular use case.

Here is the link:

Regards,

Vidya.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Thanks Vidya,

Just compiled run with *oneMKL 2022.0*. It crashed with the following message:

n= 67500

nnz= 2278158750

n= 67500

nnz= 2278158750

Abort(2169359) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Bcast: Other MPI error, error stack:

PMPI_Bcast(453).........................: MPI_Bcast(buf=0xa109880, count=38564, MPI_LONG_LONG_INT, root=1, comm=comm=0x84000005) failed

PMPI_Bcast(438).........................:

MPIDI_Bcast_intra_composition_delta(603):

MPIDI_POSIX_mpi_bcast(131)..............:

MPIR_Bcast_intra_binomial(133)..........: message sizes do not match across processes in the collective routine: Received 151600 but expected 308512

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

with OneAPI even the original cl_solver_sym_f.f fails, even running one process:

*** Error in PARDISO memory allocation: FACT_ADR, size to allocate: 141659056 bytes

The local (internal) PARDISO version is : 176

Minimum degree algorithm at reorder step is turned ON

Time spent in symbolic factorization (symbfct) :

Total time spent :

Parallel METIS algorithm at reorder step is turned ON

=== (null): solving a Hermitian indefinite system ===

forrtl: severe (174): SIGSEGV, segmentation fault occurred

Image PC Routine Line Source

stest 000000000867BCBA Unknown Unknown Unknown

libpthread-2.28.s 00007F1691A88C20 Unknown Unknown Unknown

libc-2.28.so 00007F169180E767 Unknown Unknown Unknown

libc-2.28.so 00007F16917030AF _IO_vfprintf Unknown Unknown

libc-2.28.so 00007F169172A784 vsnprintf Unknown Unknown

stest 00000000004B38BE Unknown Unknown Unknown

stest 00000000004A4593 Unknown Unknown Unknown

stest 000000000043D0E9 Unknown Unknown Unknown

stest 000000000042E14A Unknown Unknown Unknown

stest 000000000040D19B Unknown Unknown Unknown

stest 0000000000407EB5 Unknown Unknown Unknown

stest 0000000000406229 Unknown Unknown Unknown

stest 0000000000406022 Unknown Unknown Unknown

libc-2.28.so 00007F16916D4493 __libc_start_main Unknown Unknown

stest 0000000000405F2E Unknown Unknown Unknown

===================================================================================

= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES

= RANK 1 PID 48948 RUNNING AT panopt

= KILLED BY SIGNAL: 9 (Killed)

===================================================================================

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Vidya, please, ignore previous message. I linked to the old libraries before. Linking to the new ones does not crash. Will come back tomorrow with results of the test.

Thanks, Pan

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Confirming that using oneAPI test fails for non zeros exceeding 2147483647 similarly to the older versions. You should be able to replicate this at your end.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Hi Pan,

Thanks for letting us know.

We are working on your issue. we will get back to you soon with an update.

Regards,

Vidya.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Hi Vidya,

Thanks for looking into this. One more thing to add is that for n=65000, nnz=2112532500, the test complied with oneAPI results into segfault during phase 22. With the old compiler it succeeds. So, the solver got worse with the oneAPI, now it even crashes for nnz less than 2147483647.

Regards, Pan

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Hi Vidya, any progress on this? Have you been able to replicate the issue at your end? Thanks, Pan

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Hi Pan,

I am trying to find a system with 200GB of RAM to see if I can reproduce the issue that you mentioned.

If I want to test the code on a cluster then any specific cluster requirement for the code to run?

I also noticed the following:

1) You link to the 32-bit integer of the libraries libmkl_intel_lp64.a and libmkl_blacs_intelmpi_lp64.a.

Let switch to 64-bit integer, instead using libmkl_intel_ilp64.a and libmkl_blacs_intelmpi_ilp64.a

2) It seems like there is problem with the stack size. Why don't you increase the stack size using the command: ulimit

Best,

Khang

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Hi Khang:

It should not be so hard to find a machine with 200GB ram. cluster_sparse_solver_64 is intended for solving vary large systems that could require TB's of ram on several nodes. I setup the test to use a full matrix which requires the smallest memory consumption so that one can easily replicate with issues. You can try n=60000 that results to nnz=1800030000 which still sadly crashes with the OneAPI compiler.

The OneAPI version is so bad, that it crashes running one process on a single node. I tested the older version on a small cluster running 10G tcp.

libmkl_intel_ilp64.a is not acceptable as it would slow down the rest of my code and further increase memory consumption.

The tests were already performed with ulimit unlimited

Thanks, Pan

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Hi Pan,

I just want to let you know that I am waiting for permission to access the cluster system in order to confirm your issue.

In the mean time, can you tell me the version number of the Intel MPI that you are using?

Also, do you still see this same issue when changing the number of RANK to greater than 2?

Best,

Khang

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

which mpirun

/opt/intel/oneapi/mpi/2021.5.1/bin/mpirun

which mpiifort

/opt/intel/oneapi/mpi/2021.5.1/bin/mpiifort

Waiting for results mpirun -ppn 4

Nevertheless, segfault with 1 or 2 ranks is still not acceptable.

Can you put me in contact with the folks that maintain/develop the pardiso/cluster_sparse_solver? I could provide them quite a lot feedback on the state of these routines.

Thanks, Pan

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

[pan@panopt v4]$ mpirun -ppn 4 pdstest

rank= 2 n= 60000

rank= 2 nnz= 1800030000

rank= 3 n= 60000

rank= 3 nnz= 1800030000

rank= 1 n= 60000

rank= 1 nnz= 1800030000

rank= 0 n= 60000

rank= 0 nnz= 1800030000

Memory allocated on phase 11 on Rank # 0 107523.4479 MB

Memory allocated on phase 11 on Rank # 1 77333.9227 MB

Memory allocated on phase 11 on Rank # 2 72119.4802 MB

Memory allocated on phase 11 on Rank # 3 66969.6388 MB

Reordering completed ...

Number of non-zeros in L on Rank # 0 1368775758

Number of non-zeros in U on Rank # 0 1

Number of non-zeros in L on Rank # 1 341990777

Number of non-zeros in U on Rank # 1 1

Number of non-zeros in L on Rank # 2 68154240

Number of non-zeros in U on Rank # 2 1

Number of non-zeros in L on Rank # 3 22996740

Number of non-zeros in U on Rank # 3 1

forrtl: severe (174): SIGSEGV, segmentation fault occurred

Image PC Routine Line Source

pdstest 00000000070967BA Unknown Unknown Unknown

libpthread-2.28.s 00007F3838BE0C20 Unknown Unknown Unknown

pdstest 0000000000C5B2C3 Unknown Unknown Unknown

libiomp5.so 00007F38388ECBB3 __kmp_invoke_micr Unknown Unknown

libiomp5.so 00007F3838868903 Unknown Unknown Unknown

libiomp5.so 00007F3838867912 Unknown Unknown Unknown

libiomp5.so 00007F38388ED83C Unknown Unknown Unknown

libpthread-2.28.s 00007F3838BD617A Unknown Unknown Unknown

libc-2.28.so 00007F383616DDF3 clone Unknown Unknown

===================================================================================

= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES

= RANK 1 PID 396051 RUNNING AT panopt

= KILLED BY SIGNAL: 9 (Killed)

===================================================================================

===================================================================================

= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES

= RANK 2 PID 396052 RUNNING AT panopt

= KILLED BY SIGNAL: 9 (Killed)

===================================================================================

===================================================================================

= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES

= RANK 3 PID 396053 RUNNING AT panopt

= KILLED BY SIGNAL: 9 (Killed)

===================================================================================

[pan@panopt v4]$

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Hi Pan,

I just tested your code with the latest version of oneMKL, 2022.1, and was able to confirm the issue.

I ran the code with 2 ranks it crashed with the error : "Bad termination of one of your application..."

The developer will look into this issue.

I will let you know what we find out about this issue.

Best regards,

Khang

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Hi Khang, thanks for confirming.

I would appreciate if you could forward to the developer the following:

**1) The segfault is a new issue that came up withe the OneMKL versions of mpirun**

**2) You can run my code as a single process without mpirun to avoid the segfault and you will find out the cluster_sparse_solver_64 fails to provide the correct solution for nnz > ****2278158750**:

*pdstest**rank= 0 n= 67500**rank= 0 nnz= 2278158750**Memory allocated on phase 11 174106.3752 MB**Reordering completed ... **Number of non-zeros in L 2280281613 **Number of non-zeros in U 1 **Memory allocated on phase 22 193103.5310 MB*

*Percentage of computed non-zeros for LL^T factorization**13 % 21 % 28 % 36 % 42 % 49 % 55 % 61 % 66 % 71 % 76 % 80 % 83 % 87 % 90 % 92 % 95 % 96 % 98 % 99 % 100 % **Factorization completed ... **Solve completed ... **The solution of the system is **Relative residual = 3.849001794597505E-003**Error: residual is too high!**TEST FAILED**1*

**3) You can change the code to call pardiso_64 instead of cluster_sparse_solver_64 and then find out that pardiso_64 does indeed provide the correct solution for nnz > 2278158750:**

*pdst**rank= 0 n= 67500**rank= 0 nnz= 2278158750*

*=== PARDISO: solving a symmetric indefinite system ===**1-based array indexing is turned ON**PARDISO double precision computation is turned ON**Parallel METIS algorithm at reorder step is turned ON*

*Summary: ( reordering phase )**================*

*Times:**======**Time spent in calculations of symmetric matrix portrait (fulladj): 48.290804 s**Time spent in reordering of the initial matrix (reorder) : 167.526839 s**Time spent in symbolic factorization (symbfct) : 19.645896 s**Time spent in data preparations for factorization (parlist) : 0.041810 s**Time spent in allocation of internal data structures (malloc) : 0.066780 s**Time spent in additional calculations : 109.997327 s**Total time spent : 345.569456 s*

*Statistics:**===========**Parallel Direct Factorization is running on 47 OpenMP*

*< Linear system Ax = b >**number of equations: 67500**number of non-zeros in A: 2278158750**number of non-zeros in A (%): 50.000741*

*number of right-hand sides: 1*

*< Factors L and U >**number of columns for each panel: 64**number of independent subgraphs: 0**number of supernodes: 1055**size of largest supernode: 67500**number of non-zeros in L: 2280284560**number of non-zeros in U: 1**number of non-zeros in L+U: 2280284561**Reordering completed ... **=== PARDISO is running in In-Core mode, because iparam(60)=0 ===*

*Percentage of computed non-zeros for LL^T factorization**4 % 8 % 9 % 18 % 25 % 33 % 40 % 46 % 52 % 58 % 63 % 69 % 73 % 77 % 81 % 85 % 88 % 91 % 93 % 95 % 97 % 98 % 99 % 100 %*

*=== PARDISO: solving a symmetric indefinite system ===**Single-level factorization algorithm is turned ON*

*Summary: ( factorization phase )**================*

*Times:**======**Time spent in copying matrix to internal data structure (A to LU): 0.000000 s**Time spent in factorization step (numfct) : 104.567792 s**Time spent in allocation of internal data structures (malloc) : 0.000041 s**Time spent in additional calculations : 0.000001 s**Total time spent : 104.567834 s*

*Statistics:**===========**Parallel Direct Factorization is running on 47 OpenMP*

*< Linear system Ax = b >**number of equations: 67500**number of non-zeros in A: 2278158750**number of non-zeros in A (%): 50.000741*

*number of right-hand sides: 1*

*< Factors L and U >**number of columns for each panel: 64**number of independent subgraphs: 0**number of supernodes: 1055**size of largest supernode: 67500**number of non-zeros in L: 2280284560**number of non-zeros in U: 1**number of non-zeros in L+U: 2280284561**gflop for the numerical factorization: 102661.226374*

*gflop/s for the numerical factorization: 981.767179*

*Factorization completed ...*

*=== PARDISO: solving a symmetric indefinite system ===*

*Summary: ( solution phase )**================*

*Times:**======**Time spent in direct solver at solve step (solve) : 4.053454 s**Time spent in additional calculations : 10.894469 s**Total time spent : 14.947923 s*

*Statistics:**===========**Parallel Direct Factorization is running on 47 OpenMP*

*< Linear system Ax = b >**number of equations: 67500**number of non-zeros in A: 2278158750**number of non-zeros in A (%): 50.000741*

*number of right-hand sides: 1*

*< Factors L and U >**number of columns for each panel: 64**number of independent subgraphs: 0**number of supernodes: 1055**size of largest supernode: 67500**number of non-zeros in L: 2280284560**number of non-zeros in U: 1**number of non-zeros in L+U: 2280284561**gflop for the numerical factorization: 102661.226374*

*gflop/s for the numerical factorization: 981.767179*

*Solve completed ... **The solution of the system is **Relative residual = 0.000000000000000E+000**TEST PASSED*

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Hi Pan,

Thank you for providing additional information about this issue!

Yes, the developers are aware of this latest info.

Best regards,

Khang

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Hi Pan,

Looking at the item 2 and 3 in your message, it seems like you set the rank to 1. Is that correct?

I am wondering if the code would exhibit the same behavior if you set the rank to greater than 1.

Thanks,

Khang

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Hi Khang,

I run items 2 and 3 without mpirun. Just the executable from compilation/linking process. The reasons are the following:

- I have already tested mpirun with 2 or more ranks and they all fail for large nnz
- for item 2, I wanted to bypass mpirun that causes the segfault, and check if the solver still fails to provide the correct solution as I had tested with older versions of MKL
- for item 3, pardiso does not use mpi, so more ranks would be irrelevant.

As I mentioned before, I would love to get in contact with the developers as I have done extensive testing with both pardiso and cluster_sparse_solver routines and could provide more info that may be relevant. For example for pardiso_64, if you use the two-level factorization algorithm (fortran: iparm(24)=1), it also fails to provide to provide the correct solution for nnz > 2278158750, same way as the cluster_sparse_solver_64.

Please, let me know if you need my contact info.

Regards,

Pan

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page