- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

My cluster has 24 cpus/node with 256GB ram and Infiniband. We have mpich, mvapich2, openmpi, impi all installed.

I studied the example cl_solver_sym_sp_0_based_c.c in cluster_sparse_solverc/source . I compiled it using:

make libintel64 example=cl_solver_sym_sp_0_based_c

It runs fine . However the matrix is too small to look at performance. So I modified the example to read in a 3million^2 matrix from a text file. When I run it without any mpi, using just:

./cl_solver_sym_sp_0_based_c

It solves quickly and factors the matrix in 30 seconds. A 'top' command shows the CPU % go to 2400%.

If I try and do mpirun or mpiexec -np 24 ./cl_solver_sym_sp_0_based_c , then the factorization takes nearly 10X longer! A "top" shows each process using 100%cpu.

I think I am doing something wrong with mpirun/mpiexec ? I would expect it give the same factorization times as just running it directly? I tried also playing around the OMP_NUM_THREADS variable. But nothing seemed to improve the factorization times. Here is some output of my history:

926 mpiexec -np2 /cl_solver_sym_sp_0_based_c.exe

927 mpiexec -np 2 ./cl_solver_sym_sp_0_based_c.exe

928 module avail

929 module lad mvapich2-2.1rc2-intel-16.0

930 module load mvapich2-2.1rc2-intel-16.0

931 mpiexec -np 2 ./cl_solver_sym_sp_0_based_c.exe

932 mpdboot

933 mpiexec -np 2 ./cl_solver_sym_sp_0_based_c.exe

934 export OMP_NUM_THREADS=1

935 mpiexec -np 12 ./cl_solver_sym_sp_0_based_c.exe

936 export OMP_NUM_THREADS=24

937 mpiexec -np 1 ./cl_solver_sym_sp_0_based_c.exe

938 mpirun -V

939 mpirun -np 1 ./cl_solver_sym_sp_0_based_c.exe

940 export OMP_NUM_THREADS=4

941 mpirun -np 6 ./cl_solver_sym_sp_0_based_c.exe

942 export OMP_NUM_THREADS=6

943 mpirun -np 4 ./cl_solver_sym_sp_0_based_c.exe

944 mpiexec -np 4 ./cl_solver_sym_sp_0_based_c.exe

945 mpiexec -np 1 ./cl_solver_sym_sp_0_based_c.exe

Link Copied

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

An example is worth a thousands words, so here are my example files!

cl_solver_sym_sp_0_based_c.c - Edit all the occurences of *.txt to the path where the files are on your system

https://www.dropbox.com/s/ndkzi9zojxuh1xo/cl_solver_sym_sp_0_based_c.c?dl=0

ia, ja, a, and b data in text files:

https://www.dropbox.com/s/3dkhbillyso03kc/ia_ja_a_b_data.tar.gz?dl=0

Curious what kind of performance improvement you get when running with MPI on 12, 24, 48, and 72 cpus!

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Hi Ferris.

That's really strange behaviour. Can i ask you to set msglvl to 1 and provide output here?

Thanks,

Alex

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

I am attaching the output for the non-mpi run with msglvl=1. Today when I try and run with mpi I am getting errors like:

[hussaf@cforge200 cluster_sparse_solverc]$ mpiexec -np 12 ./cl_solver_sym_sp_0_based_c.exe > out.txt

[cforge200:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11)

Reordering completed ... rank 1 in job 2 cforge200_35175 caused collective abort of all ranks

exit status of rank 1: killed by signal 9

[hussaf@cforge200 cluster_sparse_solverc]$ module load mvapich2-2.1rc2-intel-16.0

[hussaf@cforge200 cluster_sparse_solverc]$ mpirun -V

Intel(R) MPI Library for Linux* OS, Version 5.1.3 Build 20160120 (build id: 14053)

Copyright (C) 2003-2016, Intel Corporation. All rights reserved.

[hussaf@cforge200 cluster_sparse_solverc]$ mpiexec -V

Intel(R) MPI Library for Linux* OS, 64-bit applications, Version 5.1.3 Build 20160120

Copyright (C) 2003-2015 Intel Corporation. All rights reserved.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

A little progress. If I do:

mpirun -np 1 ./cl_solver_sym_sp_0_based_c.exe

Then it completes in similar time to the non-mpi run ( ./cl_solver_sym_sp_0_based_c.exe ). It does appear to be using 24 threads .

Now I want to test this on two hosts. So my hostfile looks like:

cforge200:24

cforge201:24

When I execute:

mpirun -np 2 -hostfile /home/hussaf/intel/cluster_sparse_solverc/hostfile ./cl_solver_sym_sp_0_based_c.exe

It runs everything on one execution node and creates two MPI processes on cforge200. The solve time is same as previous cases. How can I get it to run on two hosts using all 48 cpus?

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

I made some more progress. Instead of -hostfile, I had to use -machinefile. So my command is:

mpirun -np 2 -env OMP_NUM_THREADS=24 -machinefile ./hostfile ./cl_solver_sym_sp_0_based_c.exe

I am attaching the output of this run with msglvl=1 . As you can see it solves nearly 8X longer than when just run on one node with no mpi ! Any suggestions for how to debug further?

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

No idea. I see such results on cluster with poor network but you wrote that infiniband used. Currently I am far for my cluster but I will download and run you testcase tomorrow when will back to office to check results on my side, ok?

Thanks,

Alex

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

I figured out my issue. I was using mpirun by mvapich2-2.1rc2-intel-16.0 . When I used Intel mpirun, the problem solved fast. I am now facing a new issue where I can only solve on 1 or 2 compute nodes. If I try and use 3 or more compute nodes, I get an error. Will start a new thread on that to avoid confusion!

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Hi,

Is the matrix the same?

Thanks,

Alex

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Yes, matrix is the same. I will start a new forum post that describes the issue and how to reproduce it.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

We see the problem with the current version of mkl 11.3.3 but this has been fixed into the next update 4 which we are planning to release soon. We will keep you updated when this release happens.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

The 11.3 update 4 has been released the last week. You may try to check the problem on his side. thanks.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page