Intel® oneAPI Math Kernel Library
Ask questions and share information with other developers who use Intel® Math Kernel Library.
6956 Discussions

Re-use cluster pardiso factorization crashed

Letian_W_
Beginner
345 Views

I'm trying to use cluster pardiso to solve large equations with complex hermitian matrix. I have multiple (>1000) right hand side but matrix A is the same, so I do phase 11 and 22 factorization first, then want reuse the factorization to solve the equation with multiple right hand side. I know pardiso has the capability to solve all the right hand side by one call, but it had memory issues due to the very large size of my matrix. So I tried do phase 11 & 22 first, then read each right hand side from binary file and solve phase 33 in a do loop. For the first 20 or so right hand side, the code runs good and it seems calculate the correct solution, but then the program crashes. I believe there might be some memory leaks.

Any suggestions/ideas are welcome, here is the error message, I run the code on 4 nodes. line 171 is where I call pardiso in phase 33. Thanks.

forrtl: error (78): process killed (SIGTERM)
Image              PC                Routine            Line        Source            
z2d.exe            0000000005A33DC5  Unknown               Unknown  Unknown
z2d.exe            0000000005A319E7  Unknown               Unknown  Unknown
z2d.exe            00000000059EDD14  Unknown               Unknown  Unknown
z2d.exe            00000000059EDB26  Unknown               Unknown  Unknown
z2d.exe            00000000059B0B36  Unknown               Unknown  Unknown
z2d.exe            00000000059B411E  Unknown               Unknown  Unknown
libpthread.so.0    00000036B220F710  Unknown               Unknown  Unknown
z2d.exe            00000000059B4040  Unknown               Unknown  Unknown
libpthread.so.0    00000036B220F710  Unknown               Unknown  Unknown
libmpi.so.12       00002B23E8A8BEE0  Unknown               Unknown  Unknown
forrtl: error (78): process killed (SIGTERM)
Image              PC                Routine            Line        Source            
z2d.exe            0000000005A33DC5  Unknown               Unknown  Unknown
z2d.exe            0000000005A319E7  Unknown               Unknown  Unknown
z2d.exe            00000000059EDD14  Unknown               Unknown  Unknown
z2d.exe            00000000059EDB26  Unknown               Unknown  Unknown
z2d.exe            00000000059B0B36  Unknown               Unknown  Unknown
z2d.exe            00000000059B411E  Unknown               Unknown  Unknown
libpthread.so.0    0000003A8BC0F710  Unknown               Unknown  Unknown
libmpi.so.12       00002B21F3490FB2  Unknown               Unknown  Unknown
libmpi.so.12       00002B21F32D6FBC  Unknown               Unknown  Unknown
libmpi.so.12       00002B21F33E2E39  Unknown               Unknown  Unknown
libmpi.so.12       00002B21F33E347A  Unknown               Unknown  Unknown
libmpi.so.12       00002B21F32C2788  Unknown               Unknown  Unknown
libmpi.so.12       00002B21F32C100A  Unknown               Unknown  Unknown
libmpi.so.12       00002B21F32C02CF  Unknown               Unknown  Unknown
libmpi.so.12       00002B21F32C3A2B  Unknown               Unknown  Unknown
libmpi.so.12       00002B21F32C343E  Unknown               Unknown  Unknown
z2d.exe            00000000015DCDE2  Unknown               Unknown  Unknown
z2d.exe            00000000005D3CEB  Unknown               Unknown  Unknown
z2d.exe            0000000000557699  Unknown               Unknown  Unknown
z2d.exe            0000000000553999  MAIN__                    171  z2d_1by1.f90
z2d.exe            0000000000552C1E  Unknown               Unknown  Unknown
libc.so.6          0000003A8B81ED1D  Unknown               Unknown  Unknown
z2d.exe            0000000000552B29  Unknown               Unknown  Unknown
forrtl: error (78): process killed (SIGTERM)
Image              PC                Routine            Line        Source            
z2d.exe            0000000005A33DC5  Unknown               Unknown  Unknown
z2d.exe            0000000005A319E7  Unknown               Unknown  Unknown
z2d.exe            00000000059EDD14  Unknown               Unknown  Unknown
z2d.exe            00000000059EDB26  Unknown               Unknown  Unknown
z2d.exe            00000000059B0B36  Unknown               Unknown  Unknown
z2d.exe            00000000059B411E  Unknown               Unknown  Unknown
libpthread.so.0    0000003B9840F710  Unknown               Unknown  Unknown
libmpi.so.12       00002B5AFB7954A0  Unknown               Unknown  Unknown
libmpi.so.12       00002B5AFB8C5FD0  Unknown               Unknown  Unknown
libmpi.so.12       00002B5AFB70BFBC  Unknown               Unknown  Unknown
libmpi.so.12       00002B5AFB817E39  Unknown               Unknown  Unknown
libmpi.so.12       00002B5AFB81847A  Unknown               Unknown  Unknown
libmpi.so.12       00002B5AFB6F7788  Unknown               Unknown  Unknown
libmpi.so.12       00002B5AFB6F600A  Unknown               Unknown  Unknown
libmpi.so.12       00002B5AFB6F52CF  Unknown               Unknown  Unknown
libmpi.so.12       00002B5AFB6F8A2B  Unknown               Unknown  Unknown
libmpi.so.12       00002B5AFB6F843E  Unknown               Unknown  Unknown
z2d.exe            00000000015DCDE2  Unknown               Unknown  Unknown
z2d.exe            00000000005D3CEB  Unknown               Unknown  Unknown
z2d.exe            0000000000557699  Unknown               Unknown  Unknown
z2d.exe            0000000000553999  MAIN__                    171  z2d_1by1.f90
z2d.exe            0000000000552C1E  Unknown               Unknown  Unknown
libc.so.6          0000003B9801ED1D  Unknown               Unknown  Unknown
z2d.exe            0000000000552B29  Unknown               Unknown  Unknown
forrtl: error (78): process killed (SIGTERM)
Image              PC                Routine            Line        Source            
z2d.exe            0000000005A33DC5  Unknown               Unknown  Unknown
z2d.exe            0000000005A319E7  Unknown               Unknown  Unknown
z2d.exe            00000000059EDD14  Unknown               Unknown  Unknown
z2d.exe            00000000059EDB26  Unknown               Unknown  Unknown
z2d.exe            00000000059B0B36  Unknown               Unknown  Unknown
z2d.exe            00000000059B411E  Unknown               Unknown  Unknown
libpthread.so.0    000000382FC0F710  Unknown               Unknown  Unknown
z2d.exe            00000000059B4040  Unknown               Unknown  Unknown
libpthread.so.0    000000382FC0F710  Unknown               Unknown  Unknown
libmpi.so.12       00002AAD1016F800  Unknown               Unknown  Unknown
libmpi.so.12       00002AAD0FFB114B  Unknown               Unknown  Unknown
libmpi.so.12       00002AAD100BCE39  Unknown               Unknown  Unknown
libmpi.so.12       00002AAD100BCB32  Unknown               Unknown  Unknown
libmpi.so.12       00002AAD0FF962F9  Unknown               Unknown  Unknown
libmpi.so.12       00002AAD0FF95D5D  Unknown               Unknown  Unknown
libmpi.so.12       00002AAD0FF95BDC  Unknown               Unknown  Unknown
libmpi.so.12       00002AAD0FF95B0C  Unknown               Unknown  Unknown
libmpi.so.12       00002AAD0FF97932  Unknown               Unknown  Unknown
z2d.exe            00000000015DCCB9  Unknown               Unknown  Unknown
z2d.exe            00000000005C59D5  Unknown               Unknown  Unknown
z2d.exe            0000000000557A80  Unknown               Unknown  Unknown
z2d.exe            0000000000553999  MAIN__                    171  z2d_1by1.f90
z2d.exe            0000000000552C1E  Unknown               Unknown  Unknown
libc.so.6          000000382F81ED1D  Unknown               Unknown  Unknown
z2d.exe            0000000000552B29  Unknown               Unknown  Unknown

0 Kudos
2 Replies
Gennady_F_Intel
Moderator
345 Views

Letian, are you takling about 11.3.3 version of MKL? How we may reproduce this case? can you give the example?

0 Kudos
Letian_W_
Beginner
345 Views

Gennady,

Please find the attached example, Here is what I tried:

  1. Using one right hand side, got the solution correctly

  2. Put NX63 matrixes into B, and solve the equations with 63 right hand sides, also got the solutions correctly

  3. Put NX1000 matrix into B, system run out of memory during back substitution, program crashed

  4. So I tried alternative was, put the substitution 1 by 1 and put it in a do loop, then the program crashed after several solutions. Attached is the test code I could duplicate the problem in GE Global research linux cluster:

    1. Z2d_1by1_demo.f90 – source code, generate a large sparse matrix (3.5M X 3.5M), do phase=11, then phase=22, then 2000 phase=33 in a do loop

    2. Z2d.out – output file, stopped at solution 245 right hand side

    3. Use_script.stderr  - error message

    4. Rank?????.error – write out the return ERROR from cluster_pardiso after each iteration, all the returned  ERROR was 0, but the code crashed at iteration 245

    5. I ran this code with 4 MPI, 20openmps each MPI

 

Please test at intel side, and see if you can repeat the same error. If my code has problem, please let me know.

 

By the way, when I read the output during substitution phase:

Times:

======

Time spent in direct solver at solve step (solve)                : 1.819070 s

Time spent in additional calculations                            : 16.224561 s

Total time spent                                                 : 18.043631 s

 

It seems the additional calculations spent much more time than solver itself, that makes the back substitution not efficient. What is the additional calculation?

 

Thanks.

 

Letian

 

0 Kudos
Reply