Intel® oneAPI Math Kernel Library
Ask questions and share information with other developers who use Intel® Math Kernel Library.
Announcements
FPGA community forums and blogs have moved to the Altera Community. Existing Intel Community members can sign in with their current credentials.

Pardiso solver much slower when using MPI?

Ferris_H_
Beginner
2,146 Views

My cluster has 24 cpus/node with 256GB ram and Infiniband. We have mpich, mvapich2, openmpi, impi all installed.

I studied the example cl_solver_sym_sp_0_based_c.c in cluster_sparse_solverc/source . I compiled it using:

make libintel64 example=cl_solver_sym_sp_0_based_c

It runs fine . However the matrix is too small to look at performance. So I modified the example to read in a 3million^2 matrix from a text file. When I run it without any mpi, using just:

./cl_solver_sym_sp_0_based_c

It solves quickly and factors the matrix in 30 seconds. A 'top' command shows the CPU % go to 2400%.

If I try and do mpirun or mpiexec -np 24 ./cl_solver_sym_sp_0_based_c , then the factorization takes nearly 10X longer! A "top" shows each process using 100%cpu.

I think I am doing something wrong with mpirun/mpiexec ? I would expect it give the same factorization times as just running it directly? I tried also playing around the OMP_NUM_THREADS variable. But nothing seemed to improve the factorization times. Here is some output of my history:

  926  mpiexec -np2 /cl_solver_sym_sp_0_based_c.exe
  927  mpiexec -np 2 ./cl_solver_sym_sp_0_based_c.exe
  928  module avail
  929  module lad mvapich2-2.1rc2-intel-16.0
  930  module load mvapich2-2.1rc2-intel-16.0
  931  mpiexec -np 2 ./cl_solver_sym_sp_0_based_c.exe
  932  mpdboot
  933  mpiexec -np 2 ./cl_solver_sym_sp_0_based_c.exe
  934  export OMP_NUM_THREADS=1
  935  mpiexec -np 12 ./cl_solver_sym_sp_0_based_c.exe
  936  export OMP_NUM_THREADS=24
  937  mpiexec -np 1 ./cl_solver_sym_sp_0_based_c.exe
  938  mpirun -V
  939  mpirun -np 1 ./cl_solver_sym_sp_0_based_c.exe
  940  export OMP_NUM_THREADS=4
  941  mpirun -np 6 ./cl_solver_sym_sp_0_based_c.exe
  942  export OMP_NUM_THREADS=6
  943  mpirun -np 4 ./cl_solver_sym_sp_0_based_c.exe
  944  mpiexec -np 4 ./cl_solver_sym_sp_0_based_c.exe
  945  mpiexec -np 1 ./cl_solver_sym_sp_0_based_c.exe

 

 

 

 

0 Kudos
11 Replies
Ferris_H_
Beginner
2,146 Views

An example is worth a thousands words, so here are my example files!

cl_solver_sym_sp_0_based_c.c - Edit all the occurences of *.txt to the path where the files are on your system

https://www.dropbox.com/s/ndkzi9zojxuh1xo/cl_solver_sym_sp_0_based_c.c?dl=0

ia, ja, a, and b data in text files:

https://www.dropbox.com/s/3dkhbillyso03kc/ia_ja_a_b_data.tar.gz?dl=0

Curious what kind of performance improvement you get when running with MPI on 12, 24, 48, and 72 cpus!

 

0 Kudos
Alexander_K_Intel2
2,146 Views

Hi Ferris.

That's really strange behaviour. Can i ask you to set msglvl to 1 and provide output here?

Thanks,

Alex

0 Kudos
Ferris_H_
Beginner
2,146 Views

I am attaching the output for the non-mpi run  with msglvl=1. Today when I try and run with mpi I am getting errors like:

[hussaf@cforge200 cluster_sparse_solverc]$ mpiexec -np 12 ./cl_solver_sym_sp_0_based_c.exe > out.txt
[cforge200:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11)

Reordering completed ... rank 1 in job 2  cforge200_35175   caused collective abort of all ranks
  exit status of rank 1: killed by signal 9

[hussaf@cforge200 cluster_sparse_solverc]$ module load mvapich2-2.1rc2-intel-16.0
[hussaf@cforge200 cluster_sparse_solverc]$ mpirun -V
Intel(R) MPI Library for Linux* OS, Version 5.1.3 Build 20160120 (build id: 14053)
Copyright (C) 2003-2016, Intel Corporation. All rights reserved.
[hussaf@cforge200 cluster_sparse_solverc]$ mpiexec -V
Intel(R) MPI Library for Linux* OS, 64-bit applications, Version 5.1.3  Build 20160120
Copyright (C) 2003-2015 Intel Corporation.  All rights reserved.

 

 

0 Kudos
Ferris_H_
Beginner
2,146 Views

A little progress. If I do:

mpirun -np 1 ./cl_solver_sym_sp_0_based_c.exe

Then it completes in similar time to the non-mpi run ( ./cl_solver_sym_sp_0_based_c.exe ). It does appear to be using 24 threads .

Now I want to test this on two hosts. So my hostfile looks like:

cforge200:24
cforge201:24

When I execute:

 mpirun -np 2 -hostfile /home/hussaf/intel/cluster_sparse_solverc/hostfile ./cl_solver_sym_sp_0_based_c.exe

It runs everything on one execution node and creates two MPI processes on cforge200. The solve time is same as previous cases. How can I get it to run on two hosts using all 48 cpus?

 

 

 

0 Kudos
Ferris_H_
Beginner
2,146 Views

I made some more progress. Instead of -hostfile, I had to use -machinefile. So my command is:

mpirun -np 2 -env OMP_NUM_THREADS=24 -machinefile ./hostfile ./cl_solver_sym_sp_0_based_c.exe

I am attaching the output of this run with msglvl=1 . As you can see it solves nearly 8X longer than when just run on one node with no mpi ! Any suggestions for how to debug further?

 

0 Kudos
Alexander_K_Intel2
2,146 Views

No idea. I see such results on cluster with poor network but you wrote that infiniband used. Currently I am far for my cluster but I will download and run you testcase tomorrow when will back to office to check results on my side, ok?

Thanks,

Alex

 

 

0 Kudos
Ferris_H_
Beginner
2,146 Views

I figured out my issue. I was using mpirun by mvapich2-2.1rc2-intel-16.0 . When I used Intel mpirun, the problem solved fast. I am now facing a new issue where I can only solve on 1 or 2 compute nodes. If I try and use 3 or more compute nodes, I get an error. Will start a new thread on that to avoid confusion!

0 Kudos
Alexander_K_Intel2
2,146 Views

Hi,

Is the matrix the same?

Thanks,

Alex

0 Kudos
Ferris_H_
Beginner
2,146 Views

Yes, matrix is the same. I will start a new forum post that describes the issue and how to reproduce it.

0 Kudos
Gennady_F_Intel
Moderator
2,146 Views

We see the problem with the current version of mkl 11.3.3 but this has been fixed into the next update 4 which we are planning to release soon. We will keep you updated when this release happens. 

0 Kudos
Gennady_F_Intel
Moderator
2,146 Views

The 11.3 update 4 has been released the last week. You may try to check the problem on his side. thanks.

0 Kudos
Reply