My cluster has 24 cpus/node with 256GB ram and Infiniband. We have mpich, mvapich2, openmpi, impi all installed.
I studied the example cl_solver_sym_sp_0_based_c.c in cluster_sparse_solverc/source . I compiled it using:
make libintel64 example=cl_solver_sym_sp_0_based_c
It runs fine . However the matrix is too small to look at performance. So I modified the example to read in a 3million^2 matrix from a text file. When I run it without any mpi, using just:
./cl_solver_sym_sp_0_based_c
It solves quickly and factors the matrix in 30 seconds. A 'top' command shows the CPU % go to 2400%.
If I try and do mpirun or mpiexec -np 24 ./cl_solver_sym_sp_0_based_c , then the factorization takes nearly 10X longer! A "top" shows each process using 100%cpu.
I think I am doing something wrong with mpirun/mpiexec ? I would expect it give the same factorization times as just running it directly? I tried also playing around the OMP_NUM_THREADS variable. But nothing seemed to improve the factorization times. Here is some output of my history:
926 mpiexec -np2 /cl_solver_sym_sp_0_based_c.exe
927 mpiexec -np 2 ./cl_solver_sym_sp_0_based_c.exe
928 module avail
929 module lad mvapich2-2.1rc2-intel-16.0
930 module load mvapich2-2.1rc2-intel-16.0
931 mpiexec -np 2 ./cl_solver_sym_sp_0_based_c.exe
932 mpdboot
933 mpiexec -np 2 ./cl_solver_sym_sp_0_based_c.exe
934 export OMP_NUM_THREADS=1
935 mpiexec -np 12 ./cl_solver_sym_sp_0_based_c.exe
936 export OMP_NUM_THREADS=24
937 mpiexec -np 1 ./cl_solver_sym_sp_0_based_c.exe
938 mpirun -V
939 mpirun -np 1 ./cl_solver_sym_sp_0_based_c.exe
940 export OMP_NUM_THREADS=4
941 mpirun -np 6 ./cl_solver_sym_sp_0_based_c.exe
942 export OMP_NUM_THREADS=6
943 mpirun -np 4 ./cl_solver_sym_sp_0_based_c.exe
944 mpiexec -np 4 ./cl_solver_sym_sp_0_based_c.exe
945 mpiexec -np 1 ./cl_solver_sym_sp_0_based_c.exe
Link Copied
An example is worth a thousands words, so here are my example files!
cl_solver_sym_sp_0_based_c.c - Edit all the occurences of *.txt to the path where the files are on your system
https://www.dropbox.com/s/ndkzi9zojxuh1xo/cl_solver_sym_sp_0_based_c.c?dl=0
ia, ja, a, and b data in text files:
https://www.dropbox.com/s/3dkhbillyso03kc/ia_ja_a_b_data.tar.gz?dl=0
Curious what kind of performance improvement you get when running with MPI on 12, 24, 48, and 72 cpus!
Hi Ferris.
That's really strange behaviour. Can i ask you to set msglvl to 1 and provide output here?
Thanks,
Alex
I am attaching the output for the non-mpi run with msglvl=1. Today when I try and run with mpi I am getting errors like:
[hussaf@cforge200 cluster_sparse_solverc]$ mpiexec -np 12 ./cl_solver_sym_sp_0_based_c.exe > out.txt
[cforge200:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11)
Reordering completed ... rank 1 in job 2 cforge200_35175 caused collective abort of all ranks
exit status of rank 1: killed by signal 9
[hussaf@cforge200 cluster_sparse_solverc]$ module load mvapich2-2.1rc2-intel-16.0
[hussaf@cforge200 cluster_sparse_solverc]$ mpirun -V
Intel(R) MPI Library for Linux* OS, Version 5.1.3 Build 20160120 (build id: 14053)
Copyright (C) 2003-2016, Intel Corporation. All rights reserved.
[hussaf@cforge200 cluster_sparse_solverc]$ mpiexec -V
Intel(R) MPI Library for Linux* OS, 64-bit applications, Version 5.1.3 Build 20160120
Copyright (C) 2003-2015 Intel Corporation. All rights reserved.
A little progress. If I do:
mpirun -np 1 ./cl_solver_sym_sp_0_based_c.exe
Then it completes in similar time to the non-mpi run ( ./cl_solver_sym_sp_0_based_c.exe ). It does appear to be using 24 threads .
Now I want to test this on two hosts. So my hostfile looks like:
cforge200:24
cforge201:24
When I execute:
mpirun -np 2 -hostfile /home/hussaf/intel/cluster_sparse_solverc/hostfile ./cl_solver_sym_sp_0_based_c.exe
It runs everything on one execution node and creates two MPI processes on cforge200. The solve time is same as previous cases. How can I get it to run on two hosts using all 48 cpus?
I made some more progress. Instead of -hostfile, I had to use -machinefile. So my command is:
mpirun -np 2 -env OMP_NUM_THREADS=24 -machinefile ./hostfile ./cl_solver_sym_sp_0_based_c.exe
I am attaching the output of this run with msglvl=1 . As you can see it solves nearly 8X longer than when just run on one node with no mpi ! Any suggestions for how to debug further?
No idea. I see such results on cluster with poor network but you wrote that infiniband used. Currently I am far for my cluster but I will download and run you testcase tomorrow when will back to office to check results on my side, ok?
Thanks,
Alex
I figured out my issue. I was using mpirun by mvapich2-2.1rc2-intel-16.0 . When I used Intel mpirun, the problem solved fast. I am now facing a new issue where I can only solve on 1 or 2 compute nodes. If I try and use 3 or more compute nodes, I get an error. Will start a new thread on that to avoid confusion!
Hi,
Is the matrix the same?
Thanks,
Alex
Yes, matrix is the same. I will start a new forum post that describes the issue and how to reproduce it.
We see the problem with the current version of mkl 11.3.3 but this has been fixed into the next update 4 which we are planning to release soon. We will keep you updated when this release happens.
The 11.3 update 4 has been released the last week. You may try to check the problem on his side. thanks.
For more complete information about compiler optimizations, see our Optimization Notice.