Intel® oneAPI Math Kernel Library
Ask questions and share information with other developers who use Intel® Math Kernel Library.
7059 Discussions

Direct Sparse Solver for Clusters Crash when using MPI Nested Dissection Algorithm

William_D_2
Beginner
792 Views

I have a code that calls the the Direct Sparse Solver for Clusters Interface.I have an error when I run it using the option for the MPI based nested dissection. Documentation can be found here: https://software.intel.com/en-us/mkl-developer-reference-c-cluster-sparse-solver-iparm-parameter

When i have iparam[1] = 3 everything works fine.

When I set it iparam[1] = 10, I get no errors, no warnings, no output when msglvl is set to one. I assume this is because the system is crashing really hard.

I am using the 64-bit interface of the solver and not using the MPI-based dissection is not an option (my matrix has 50 billion non-zero elements and is 12 billion by 12 billion). I am using the Latest version of the MKL cluster library.

I just spent two weeks modifying the code to remove overlaps in the matrix elements to use this feature.

What Is going wrong?

Edit #1: Changed iparam[39] to iparam[1].

 

0 Kudos
8 Replies
Gennady_F_Intel
Moderator
792 Views

>> When I set it to 10, I get no errors, no warnings, no output when msglvl is set to one. I assume this is because the system is crashing really hard.

do you mean when iparm[39] == 1 or 0?

0 Kudos
William_D_2
Beginner
792 Views

Gennady,

Please See Revised question. I meant iparam[1] = 10;

0 Kudos
Gennady_F_Intel
Moderator
792 Views

In the case if you properly split the input matrix  across MPI processes then it look like the bug in MPI version of nested dissection algorithm. How many MPI processes did you run? and what is the size of RAM on each of nodes? 

 

0 Kudos
William_D_2
Beginner
792 Views

So i am currently testing this code on my local machine. It is a single CPU (4-core) is 32 GB of memory total. I am running four MPI processes and specifically an extremely small problem size (48x48 matrix). This is just trying to feel out the algorithm.

I was able to make some modifications to my other code and I am now getting an error value of -2. I know this means there is not enough memory, but it makes no sense. I have 21 GB available currently and the rest of the data takes no more than 1 GB, for my trial runs.

0 Kudos
William_D_2
Beginner
792 Views

This out of memory error occurs even when I have a single MPI process and iparam[1] =10. When running a single MPI process it is supposed to failsafe to one of the other partitioning algorithms. It is not doing that however. So this must be in a memory allocation that is happening by default if the flag is set.

0 Kudos
William_D_2
Beginner
792 Views

@Gennady

Is there any hope of this bug being fixed in the 2018 builds? I am using this library as part of my dissertation project and would like to know if I need to shift to a different library to resolve the problem.

0 Kudos
Gennady_F_Intel
Moderator
792 Views

the problem is we couldn't reproduce this behavior on our side with the latest MKL 2018 beta u1 and MKL 2017.u3.

Could you give us the reproducer to check the problem on our side?

0 Kudos
William_D_2
Beginner
792 Views

Gennady,

Sorry for the long reply I was making sure there were other bugs in my code that were not problems. i found some other inconsistencies but fixed those and came back to the same conclusion. Attached is a couple of codes that reproduce the bugs I saw. It is formatted after my own code with only superficial changes due to the need to simplify the interface. Also there is a README  with some more relevant details.

Let me know what your guys think/see.

0 Kudos
Reply