My cluster has 24 cpus/node with 256GB ram and Infiniband. We have mpich, mvapich2, openmpi, impi all installed.
I studied the example cl_solver_sym_sp_0_based_c.c in cluster_sparse_solverc/source . I compiled it using:
make libintel64 example=cl_solver_sym_sp_0_based_c
It runs fine . However the matrix is too small to look at performance. So I modified the example to read in a 3million^2 matrix from a text file. When I run it without any mpi, using just:
It solves quickly and factors the matrix in 30 seconds. A 'top' command shows the CPU % go to 2400%.
If I try and do mpirun or mpiexec -np 24 ./cl_solver_sym_sp_0_based_c , then the factorization takes nearly 10X longer! A "top" shows each process using 100%cpu.
I think I am doing something wrong with mpirun/mpiexec ? I would expect it give the same factorization times as just running it directly? I tried also playing around the OMP_NUM_THREADS variable. But nothing seemed to improve the factorization times. Here is some output of my history:
926 mpiexec -np2 /cl_solver_sym_sp_0_based_c.exe
927 mpiexec -np 2 ./cl_solver_sym_sp_0_based_c.exe
928 module avail
929 module lad mvapich2-2.1rc2-intel-16.0
930 module load mvapich2-2.1rc2-intel-16.0
931 mpiexec -np 2 ./cl_solver_sym_sp_0_based_c.exe
933 mpiexec -np 2 ./cl_solver_sym_sp_0_based_c.exe
934 export OMP_NUM_THREADS=1
935 mpiexec -np 12 ./cl_solver_sym_sp_0_based_c.exe
936 export OMP_NUM_THREADS=24
937 mpiexec -np 1 ./cl_solver_sym_sp_0_based_c.exe
938 mpirun -V
939 mpirun -np 1 ./cl_solver_sym_sp_0_based_c.exe
940 export OMP_NUM_THREADS=4
941 mpirun -np 6 ./cl_solver_sym_sp_0_based_c.exe
942 export OMP_NUM_THREADS=6
943 mpirun -np 4 ./cl_solver_sym_sp_0_based_c.exe
944 mpiexec -np 4 ./cl_solver_sym_sp_0_based_c.exe
945 mpiexec -np 1 ./cl_solver_sym_sp_0_based_c.exe
An example is worth a thousands words, so here are my example files!
cl_solver_sym_sp_0_based_c.c - Edit all the occurences of *.txt to the path where the files are on your system
ia, ja, a, and b data in text files:
Curious what kind of performance improvement you get when running with MPI on 12, 24, 48, and 72 cpus!
I am attaching the output for the non-mpi run with msglvl=1. Today when I try and run with mpi I am getting errors like:
[hussaf@cforge200 cluster_sparse_solverc]$ mpiexec -np 12 ./cl_solver_sym_sp_0_based_c.exe > out.txt
[cforge200:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11)
Reordering completed ... rank 1 in job 2 cforge200_35175 caused collective abort of all ranks
exit status of rank 1: killed by signal 9
[hussaf@cforge200 cluster_sparse_solverc]$ module load mvapich2-2.1rc2-intel-16.0
[hussaf@cforge200 cluster_sparse_solverc]$ mpirun -V
Intel(R) MPI Library for Linux* OS, Version 5.1.3 Build 20160120 (build id: 14053)
Copyright (C) 2003-2016, Intel Corporation. All rights reserved.
[hussaf@cforge200 cluster_sparse_solverc]$ mpiexec -V
Intel(R) MPI Library for Linux* OS, 64-bit applications, Version 5.1.3 Build 20160120
Copyright (C) 2003-2015 Intel Corporation. All rights reserved.
A little progress. If I do:
mpirun -np 1 ./cl_solver_sym_sp_0_based_c.exe
Then it completes in similar time to the non-mpi run ( ./cl_solver_sym_sp_0_based_c.exe ). It does appear to be using 24 threads .
Now I want to test this on two hosts. So my hostfile looks like:
When I execute:
mpirun -np 2 -hostfile /home/hussaf/intel/cluster_sparse_solverc/hostfile ./cl_solver_sym_sp_0_based_c.exe
It runs everything on one execution node and creates two MPI processes on cforge200. The solve time is same as previous cases. How can I get it to run on two hosts using all 48 cpus?
I made some more progress. Instead of -hostfile, I had to use -machinefile. So my command is:
mpirun -np 2 -env OMP_NUM_THREADS=24 -machinefile ./hostfile ./cl_solver_sym_sp_0_based_c.exe
I am attaching the output of this run with msglvl=1 . As you can see it solves nearly 8X longer than when just run on one node with no mpi ! Any suggestions for how to debug further?
No idea. I see such results on cluster with poor network but you wrote that infiniband used. Currently I am far for my cluster but I will download and run you testcase tomorrow when will back to office to check results on my side, ok?
I figured out my issue. I was using mpirun by mvapich2-2.1rc2-intel-16.0 . When I used Intel mpirun, the problem solved fast. I am now facing a new issue where I can only solve on 1 or 2 compute nodes. If I try and use 3 or more compute nodes, I get an error. Will start a new thread on that to avoid confusion!
We see the problem with the current version of mkl 11.3.3 but this has been fixed into the next update 4 which we are planning to release soon. We will keep you updated when this release happens.