Direct Sparse Solver for Clusters Crash when using MPI Nested Dissection Algorithm

William_D_2 · ‎06-10-2017

I have a code that calls the the Direct Sparse Solver for Clusters Interface.I have an error when I run it using the option for the MPI based nested dissection. Documentation can be found here: https://software.intel.com/en-us/mkl-developer-reference-c-cluster-sparse-solver-iparm-parameter

When i have iparam[1] = 3 everything works fine.

When I set it iparam[1] = 10, I get no errors, no warnings, no output when msglvl is set to one. I assume this is because the system is crashing really hard.

I am using the 64-bit interface of the solver and not using the MPI-based dissection is not an option (my matrix has 50 billion non-zero elements and is 12 billion by 12 billion). I am using the Latest version of the MKL cluster library.

I just spent two weeks modifying the code to remove overlaps in the matrix elements to use this feature.

What Is going wrong?

Edit #1: Changed iparam[39] to iparam[1].

Gennady_F_Intel · ‎06-11-2017

>> When I set it to 10, I get no errors, no warnings, no output when msglvl is set to one. I assume this is because the system is crashing really hard.

do you mean when iparm[39] == 1 or 0?

William_D_2 · ‎06-11-2017

Gennady,

Please See Revised question. I meant iparam[1] = 10;

Gennady_F_Intel · ‎06-11-2017

In the case if you properly split the input matrix across MPI processes then it look like the bug in MPI version of nested dissection algorithm. How many MPI processes did you run? and what is the size of RAM on each of nodes?

William_D_2 · ‎06-11-2017

So i am currently testing this code on my local machine. It is a single CPU (4-core) is 32 GB of memory total. I am running four MPI processes and specifically an extremely small problem size (48x48 matrix). This is just trying to feel out the algorithm.

I was able to make some modifications to my other code and I am now getting an error value of -2. I know this means there is not enough memory, but it makes no sense. I have 21 GB available currently and the rest of the data takes no more than 1 GB, for my trial runs.

William_D_2 · ‎06-11-2017

This out of memory error occurs even when I have a single MPI process and iparam[1] =10. When running a single MPI process it is supposed to failsafe to one of the other partitioning algorithms. It is not doing that however. So this must be in a memory allocation that is happening by default if the flag is set.

William_D_2 · ‎06-19-2017

@Gennady

Is there any hope of this bug being fixed in the 2018 builds? I am using this library as part of my dissertation project and would like to know if I need to shift to a different library to resolve the problem.

Gennady_F_Intel · ‎06-19-2017

the problem is we couldn't reproduce this behavior on our side with the latest MKL 2018 beta u1 and MKL 2017.u3.

Could you give us the reproducer to check the problem on our side?

William_D_2 · ‎07-07-2017

Gennady,

Sorry for the long reply I was making sure there were other bugs in my code that were not problems. i found some other inconsistencies but fixed those and came back to the same conclusion. Attached is a couple of codes that reproduce the bugs I saw. It is formatted after my own code with only superficial changes due to the need to simplify the interface. Also there is a README with some more relevant details.

Let me know what your guys think/see.