Hello, I'll add here the information on a support ticket I started last month to check if if the community has come up with this issue. We use the MKL parallel cluster solver, together with Intel MPI for our HPC software (called FDS). The software has to solve thousands of times a Poisson equation using the MKL cluster solver solve phase. We have noted the memory being used increases as the MKL cluster solver is used, eventually leading to a catastrophic out of memory error in MPI.
I isolated the repeated use of the MKL cluster solver on a single standalone program completely separate from our software, and still see the memory use increase.
Try following the instructions on the README file in this tarball, to compile the code and run the case to see if your memory use increases (takes a few hours of runtime). I have verified this is the case in two Linux clusters with Centos 6 and 7 and Intel parallel studio versions from 2018, 2019 and last 2020.
I would really appreciate any help on this.
Hi Gennady, thank you for taking interest! I used a submission script on both clusters fitting the 8 MPI processes in one node (one cluster has 8 physical cores per node and the other 12). This is the example (torque) for burn (12 core nodes):
#PBS -N test_glmat
#PBS -W umask=0022
#PBS -e /home4/mnv/FIREMODELS_FORK/CLUSTER_SPARSE_SOLVER_TEST/test/test_glmat.err
#PBS -o /home4/mnv/FIREMODELS_FORK/CLUSTER_SPARSE_SOLVER_TEST/test/test_glmat.log
#PBS -l nodes=1:ppn=8
#PBS -l walltime=999:0:0
module load null modules torque-maui intel/19u4
echo " Directory: `pwd`"
echo " Host: `hostname`"
/opt/intel19/compilers_and_libraries_2019.4.243/linux/mpi/intel64/bin/mpiexec -np 8 /home4/mnv/FIREMODELS_FORK/CLUSTER_SPARSE_SOLVER_TEST/test/css_test
and here is the an example submisison for the test for blaze, our other cluster with 8 cores per node (SLURM):
#SBATCH -J test_glmat
#SBATCH -e /home/mnv/FireModels_fork/CLUSTER_SPARSE_SOLVER_TEST/test/test_glmat.err
#SBATCH -o /home/mnv/FireModels_fork/CLUSTER_SPARSE_SOLVER_TEST/test/test_glmat.log
#SBATCH -p batch
#SBATCH -n 8
#SBATCH -t 99-99:99:99
echo " Input file: test_glmat.fds"
echo " Directory: `pwd`"
echo " Host: `hostname`"
mpirun -n 8 YOUR_DIR/CLUSTER_SPARSE_SOLVER_TEST/test/css_test
on a single workstation should give same outcome. I'm trying to understand if there is any combination of memory flags/routine calls that would take care of this leak I'm seeing but haven't been successful.
BTW, this is how it crashes in both cases (what it writes to the .err file, or screen):
NSOLVES = 166800
NSOLVES = 166900
NSOLVES = 167000
NSOLVES = 167100
NSOLVES = 167200
NSOLVES = 167300
NSOLVES = 167400
NSOLVES = 167500
NSOLVES = 167600
NSOLVES = 167700
Abort(606162959) on node 6 (rank 6 in comm 0): Fatal error in PMPI_Comm_split: Other MPI error, error stack:
PMPI_Comm_split(499).....: MPI_Comm_split(comm=0xc4000012, color=1, key=0, new_comm=0x7ffcc8948b30) failed
MPIR_Info_alloc(61)......: Out of memory (unable to allocate a 'MPI_Info')
I made the short experiments ( 10K iterations and with MKL 2020 ) so far and see the size of the memory consumed by a program is the same. I used vmstat utility to track this process. We will run the whole benchmark ( 250K of iterations ), this will take significant time to run. I will keep this thread updated.
If possible, could you try instead of calling mkl_free_buffers in your solving loop, call PARDISO with phase = -1 and tell us if you still observe the memory leak? This could help our investigation I hope. If you want, you can call mkl_free_buffers, but only after the very last call to MKL routines (i.e., not inside the loop).
Hi Kirill, thank you for your interest. I've tried with and without the mkl_free_buffers call within the solve loop with the same outcome. It doesn't seem to make any difference. Now, about calling the cluster_sparse_solver with phase -1, wouldn't that get rid of the stored factorization matrix and I could not keep calling the solver phase within the loop?
My other question is, have you been able to reproduce the behavior?
I made such experiments and still see the same problem as Marcos reported :
NSOLVES = 166000
NSOLVES = 167000
Abort(471945231) on node 6 (rank 6 in comm 0): Fatal error in PMPI_Comm_split: Other MPI error, error stack:
PMPI_Comm_split(499).....: MPI_Comm_split(comm=0xc4000012, color=1, key=0, new_comm=0x7ffc47d65630) failed
Hi Gennady, thank you for checking this. It is interesting that the error happens at the same instance, even though we are running the case with different hardware.
Let's see if the issue is escalated.
Please check version 2020 update 1 - MKL and MPI
I checked the example you shared and see the test passed.
NSOLVES = 249940
NSOLVES = 249950
NSOLVES = 249960
NSOLVES = 249970
NSOLVES = 249980
NSOLVES = 249990
NSOLVES = 250000
Just adding to what Gennady said, for clarification: the issue (as far as our suggestion goes) is related to MPI and not the Cluster Sparse Solver. So, if for any reason you don't want to use a newer MKL, using a newer MPI should fix the problem already.
Hi Gennady and Kirill, thank you for your help. I tested the sample case in one of our clusters with intel 2020 update 1 and it also passed.
We tried installing update 1 on another cluster that has Centos 6 and we are having library issues (glibc_2.14 is missing). It seems the latest suite will not work in Centos 6? is there a workaround for this?
Thank you very much,
First, as e.g. this page says (https://software.intel.com/en-us/articles/intel-math-kernel-library-intel-mkl-2020-system-requirements) MKL 2020 and later officially supports Centos versions not older than 7.x. So, a good solution would be to upgrade the OSon the cluster nodes.
Second, you can try to update glibc and gnu-utils packages (or get a newer version locally) and see whether this fixes the problem. Here unfortunately I cannot give a more specific advice / workaround.
Thank you Kirill, we will upgrade to Centos 7, once we are able to return to our physical workspace.
I am trying to make the mpi wrapper for mkl in my Mac workstation, which uses Mac OSX Catalina and Openmpi 4.0.2 (provided by Homebrew). When I execute the commant to make the custom blacs I get the result in the attached figure. It seems some variables on mklmpi have been deprecated in MPI 3.0?
Please let me know if I should start a different thread in the forum.
Marcos, in general starting the new thread, would be better to easier tracking the issues .... . Regarding to the MPI macros problem. It seem you use one of the latest versions of OpenMPi 4.0.2 which MKL doesn't validate at this moment. Here is the link to the mkl system requirements for your reference.
Here is the link to the Open MPI FAQ: https://www.open-mpi.org/faq/?category=mpi-removed#mpi-1-mpi-lb-ub where you see this problem has been discussed. We hope that helps.