Pardiso solver much slower when using MPI?

Ferris_H_ · ‎07-29-2016

My cluster has 24 cpus/node with 256GB ram and Infiniband. We have mpich, mvapich2, openmpi, impi all installed.

I studied the example cl_solver_sym_sp_0_based_c.c in cluster_sparse_solverc/source . I compiled it using:

make libintel64 example=cl_solver_sym_sp_0_based_c

It runs fine . However the matrix is too small to look at performance. So I modified the example to read in a 3million^2 matrix from a text file. When I run it without any mpi, using just:

./cl_solver_sym_sp_0_based_c

It solves quickly and factors the matrix in 30 seconds. A 'top' command shows the CPU % go to 2400%.

If I try and do mpirun or mpiexec -np 24 ./cl_solver_sym_sp_0_based_c , then the factorization takes nearly 10X longer! A "top" shows each process using 100%cpu.

I think I am doing something wrong with mpirun/mpiexec ? I would expect it give the same factorization times as just running it directly? I tried also playing around the OMP_NUM_THREADS variable. But nothing seemed to improve the factorization times. Here is some output of my history:

926 mpiexec -np2 /cl_solver_sym_sp_0_based_c.exe
927 mpiexec -np 2 ./cl_solver_sym_sp_0_based_c.exe
928 module avail
929 module lad mvapich2-2.1rc2-intel-16.0
930 module load mvapich2-2.1rc2-intel-16.0
931 mpiexec -np 2 ./cl_solver_sym_sp_0_based_c.exe
932 mpdboot
933 mpiexec -np 2 ./cl_solver_sym_sp_0_based_c.exe
934 export OMP_NUM_THREADS=1
935 mpiexec -np 12 ./cl_solver_sym_sp_0_based_c.exe
936 export OMP_NUM_THREADS=24
937 mpiexec -np 1 ./cl_solver_sym_sp_0_based_c.exe
938 mpirun -V
939 mpirun -np 1 ./cl_solver_sym_sp_0_based_c.exe
940 export OMP_NUM_THREADS=4
941 mpirun -np 6 ./cl_solver_sym_sp_0_based_c.exe
942 export OMP_NUM_THREADS=6
943 mpirun -np 4 ./cl_solver_sym_sp_0_based_c.exe
944 mpiexec -np 4 ./cl_solver_sym_sp_0_based_c.exe
945 mpiexec -np 1 ./cl_solver_sym_sp_0_based_c.exe

Ferris_H_ · ‎07-29-2016

An example is worth a thousands words, so here are my example files!

cl_solver_sym_sp_0_based_c.c - Edit all the occurences of *.txt to the path where the files are on your system

https://www.dropbox.com/s/ndkzi9zojxuh1xo/cl_solver_sym_sp_0_based_c.c?dl=0

ia, ja, a, and b data in text files:

https://www.dropbox.com/s/3dkhbillyso03kc/ia_ja_a_b_data.tar.gz?dl=0

Curious what kind of performance improvement you get when running with MPI on 12, 24, 48, and 72 cpus!

Alexander_K_Intel2 · ‎08-01-2016

Hi Ferris.

That's really strange behaviour. Can i ask you to set msglvl to 1 and provide output here?

Thanks,

Alex

Ferris_H_ · ‎08-01-2016

I am attaching the output for the non-mpi run with msglvl=1. Today when I try and run with mpi I am getting errors like:

[hussaf@cforge200 cluster_sparse_solverc]$ mpiexec -np 12 ./cl_solver_sym_sp_0_based_c.exe > out.txt
[cforge200:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11)

Reordering completed ... rank 1 in job 2 cforge200_35175 caused collective abort of all ranks
exit status of rank 1: killed by signal 9

[hussaf@cforge200 cluster_sparse_solverc]$ module load mvapich2-2.1rc2-intel-16.0
[hussaf@cforge200 cluster_sparse_solverc]$ mpirun -V
Intel(R) MPI Library for Linux* OS, Version 5.1.3 Build 20160120 (build id: 14053)
Copyright (C) 2003-2016, Intel Corporation. All rights reserved.
[hussaf@cforge200 cluster_sparse_solverc]$ mpiexec -V
Intel(R) MPI Library for Linux* OS, 64-bit applications, Version 5.1.3 Build 20160120
Copyright (C) 2003-2015 Intel Corporation. All rights reserved.

Ferris_H_ · ‎08-01-2016

A little progress. If I do:

mpirun -np 1 ./cl_solver_sym_sp_0_based_c.exe

Then it completes in similar time to the non-mpi run ( ./cl_solver_sym_sp_0_based_c.exe ). It does appear to be using 24 threads .

Now I want to test this on two hosts. So my hostfile looks like:

cforge200:24
cforge201:24

When I execute:

mpirun -np 2 -hostfile /home/hussaf/intel/cluster_sparse_solverc/hostfile ./cl_solver_sym_sp_0_based_c.exe

It runs everything on one execution node and creates two MPI processes on cforge200. The solve time is same as previous cases. How can I get it to run on two hosts using all 48 cpus?

Ferris_H_ · ‎08-01-2016

I made some more progress. Instead of -hostfile, I had to use -machinefile. So my command is:

mpirun -np 2 -env OMP_NUM_THREADS=24 -machinefile ./hostfile ./cl_solver_sym_sp_0_based_c.exe

I am attaching the output of this run with msglvl=1 . As you can see it solves nearly 8X longer than when just run on one node with no mpi ! Any suggestions for how to debug further?

Alexander_K_Intel2 · ‎08-01-2016

No idea. I see such results on cluster with poor network but you wrote that infiniband used. Currently I am far for my cluster but I will download and run you testcase tomorrow when will back to office to check results on my side, ok?

Thanks,

Alex

Ferris_H_ · ‎08-01-2016

I figured out my issue. I was using mpirun by mvapich2-2.1rc2-intel-16.0 . When I used Intel mpirun, the problem solved fast. I am now facing a new issue where I can only solve on 1 or 2 compute nodes. If I try and use 3 or more compute nodes, I get an error. Will start a new thread on that to avoid confusion!

Alexander_K_Intel2 · ‎08-01-2016

Hi,

Is the matrix the same?

Thanks,

Alex

Ferris_H_ · ‎08-02-2016

Yes, matrix is the same. I will start a new forum post that describes the issue and how to reproduce it.

Gennady_F_Intel · ‎08-02-2016

We see the problem with the current version of mkl 11.3.3 but this has been fixed into the next update 4 which we are planning to release soon. We will keep you updated when this release happens.

Gennady_F_Intel · ‎09-25-2016

The 11.3 update 4 has been released the last week. You may try to check the problem on his side. thanks.