Intel® oneAPI Math Kernel Library
Ask questions and share information with other developers who use Intel® Math Kernel Library.

Pardiso solver much slower when using MPI?

Ferris_H_
Beginner
672 Views

My cluster has 24 cpus/node with 256GB ram and Infiniband. We have mpich, mvapich2, openmpi, impi all installed.

I studied the example cl_solver_sym_sp_0_based_c.c in cluster_sparse_solverc/source . I compiled it using:

make libintel64 example=cl_solver_sym_sp_0_based_c

It runs fine . However the matrix is too small to look at performance. So I modified the example to read in a 3million^2 matrix from a text file. When I run it without any mpi, using just:

./cl_solver_sym_sp_0_based_c

It solves quickly and factors the matrix in 30 seconds. A 'top' command shows the CPU % go to 2400%.

If I try and do mpirun or mpiexec -np 24 ./cl_solver_sym_sp_0_based_c , then the factorization takes nearly 10X longer! A "top" shows each process using 100%cpu.

I think I am doing something wrong with mpirun/mpiexec ? I would expect it give the same factorization times as just running it directly? I tried also playing around the OMP_NUM_THREADS variable. But nothing seemed to improve the factorization times. Here is some output of my history:

  926  mpiexec -np2 /cl_solver_sym_sp_0_based_c.exe
  927  mpiexec -np 2 ./cl_solver_sym_sp_0_based_c.exe
  928  module avail
  929  module lad mvapich2-2.1rc2-intel-16.0
  930  module load mvapich2-2.1rc2-intel-16.0
  931  mpiexec -np 2 ./cl_solver_sym_sp_0_based_c.exe
  932  mpdboot
  933  mpiexec -np 2 ./cl_solver_sym_sp_0_based_c.exe
  934  export OMP_NUM_THREADS=1
  935  mpiexec -np 12 ./cl_solver_sym_sp_0_based_c.exe
  936  export OMP_NUM_THREADS=24
  937  mpiexec -np 1 ./cl_solver_sym_sp_0_based_c.exe
  938  mpirun -V
  939  mpirun -np 1 ./cl_solver_sym_sp_0_based_c.exe
  940  export OMP_NUM_THREADS=4
  941  mpirun -np 6 ./cl_solver_sym_sp_0_based_c.exe
  942  export OMP_NUM_THREADS=6
  943  mpirun -np 4 ./cl_solver_sym_sp_0_based_c.exe
  944  mpiexec -np 4 ./cl_solver_sym_sp_0_based_c.exe
  945  mpiexec -np 1 ./cl_solver_sym_sp_0_based_c.exe

 

 

 

 

0 Kudos
11 Replies
Ferris_H_
Beginner
672 Views

An example is worth a thousands words, so here are my example files!

cl_solver_sym_sp_0_based_c.c - Edit all the occurences of *.txt to the path where the files are on your system

https://www.dropbox.com/s/ndkzi9zojxuh1xo/cl_solver_sym_sp_0_based_c.c?dl=0

ia, ja, a, and b data in text files:

https://www.dropbox.com/s/3dkhbillyso03kc/ia_ja_a_b_data.tar.gz?dl=0

Curious what kind of performance improvement you get when running with MPI on 12, 24, 48, and 72 cpus!

 

0 Kudos
Alexander_K_Intel2
672 Views

Hi Ferris.

That's really strange behaviour. Can i ask you to set msglvl to 1 and provide output here?

Thanks,

Alex

0 Kudos
Ferris_H_
Beginner
672 Views

I am attaching the output for the non-mpi run  with msglvl=1. Today when I try and run with mpi I am getting errors like:

[hussaf@cforge200 cluster_sparse_solverc]$ mpiexec -np 12 ./cl_solver_sym_sp_0_based_c.exe > out.txt
[cforge200:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11)

Reordering completed ... rank 1 in job 2  cforge200_35175   caused collective abort of all ranks
  exit status of rank 1: killed by signal 9

[hussaf@cforge200 cluster_sparse_solverc]$ module load mvapich2-2.1rc2-intel-16.0
[hussaf@cforge200 cluster_sparse_solverc]$ mpirun -V
Intel(R) MPI Library for Linux* OS, Version 5.1.3 Build 20160120 (build id: 14053)
Copyright (C) 2003-2016, Intel Corporation. All rights reserved.
[hussaf@cforge200 cluster_sparse_solverc]$ mpiexec -V
Intel(R) MPI Library for Linux* OS, 64-bit applications, Version 5.1.3  Build 20160120
Copyright (C) 2003-2015 Intel Corporation.  All rights reserved.

 

 

0 Kudos
Ferris_H_
Beginner
672 Views

A little progress. If I do:

mpirun -np 1 ./cl_solver_sym_sp_0_based_c.exe

Then it completes in similar time to the non-mpi run ( ./cl_solver_sym_sp_0_based_c.exe ). It does appear to be using 24 threads .

Now I want to test this on two hosts. So my hostfile looks like:

cforge200:24
cforge201:24

When I execute:

 mpirun -np 2 -hostfile /home/hussaf/intel/cluster_sparse_solverc/hostfile ./cl_solver_sym_sp_0_based_c.exe

It runs everything on one execution node and creates two MPI processes on cforge200. The solve time is same as previous cases. How can I get it to run on two hosts using all 48 cpus?

 

 

 

0 Kudos
Ferris_H_
Beginner
672 Views

I made some more progress. Instead of -hostfile, I had to use -machinefile. So my command is:

mpirun -np 2 -env OMP_NUM_THREADS=24 -machinefile ./hostfile ./cl_solver_sym_sp_0_based_c.exe

I am attaching the output of this run with msglvl=1 . As you can see it solves nearly 8X longer than when just run on one node with no mpi ! Any suggestions for how to debug further?

 

0 Kudos
Alexander_K_Intel2
672 Views

No idea. I see such results on cluster with poor network but you wrote that infiniband used. Currently I am far for my cluster but I will download and run you testcase tomorrow when will back to office to check results on my side, ok?

Thanks,

Alex

 

 

0 Kudos
Ferris_H_
Beginner
672 Views

I figured out my issue. I was using mpirun by mvapich2-2.1rc2-intel-16.0 . When I used Intel mpirun, the problem solved fast. I am now facing a new issue where I can only solve on 1 or 2 compute nodes. If I try and use 3 or more compute nodes, I get an error. Will start a new thread on that to avoid confusion!

0 Kudos
Alexander_K_Intel2
672 Views

Hi,

Is the matrix the same?

Thanks,

Alex

0 Kudos
Ferris_H_
Beginner
672 Views

Yes, matrix is the same. I will start a new forum post that describes the issue and how to reproduce it.

0 Kudos
Gennady_F_Intel
Moderator
672 Views

We see the problem with the current version of mkl 11.3.3 but this has been fixed into the next update 4 which we are planning to release soon. We will keep you updated when this release happens. 

0 Kudos
Gennady_F_Intel
Moderator
672 Views

The 11.3 update 4 has been released the last week. You may try to check the problem on his side. thanks.

0 Kudos
Reply