performance tuning on 16node cluster

mbkumar · ‎09-19-2008

Hi,
We recently got a 16node cluster with dual processor QC-E5430 Xeon and 16GB RAM/node, all connected with infiniband. I compiled the program we use with ifort 10.1.017, mkl 10.1.014, mvapich2, scalapack, blacs. When doing the performance tests,I noticed that intra-node job distribution is taking more time to complete than inter-node job distribution.
See the table below for some numbers:

8jobs-1node 3:22:01
8jobs-2nodes 2:13:55
8jobs-4nodes 1:44:02
8jobs-8nodes 1:42:46

16jobs-2nodes 2:08:25
16jobs-4nodes 1:31:56
16jobs-8nodes 1:16:57
16jobs-16nodes 1:16:33

My question is: is this result to be expected? And how to increase the performance when 8jobs are assigned to a node i.e., 1job/core? I am using sequential mkl ilbraries.

crtierney42 · ‎09-30-2008

Quoting - mbkumar@gmail.com

Hi,
We recently got a 16node cluster with dual processor QC-E5430 Xeon and 16GB RAM/node, all connected with infiniband. I compiled the program we use with ifort 10.1.017, mkl 10.1.014, mvapich2, scalapack, blacs. When doing the performance tests,I noticed that intra-node job distribution is taking more time to complete than inter-node job distribution.
See the table below for some numbers:

8jobs-1node 3:22:01
8jobs-2nodes 2:13:55
8jobs-4nodes 1:44:02
8jobs-8nodes 1:42:46

16jobs-2nodes 2:08:25
16jobs-4nodes 1:31:56
16jobs-8nodes 1:16:57
16jobs-16nodes 1:16:33

My question is: is this result to be expected? And how to increase the performance when 8jobs are assigned to a node i.e., 1job/core? I am using sequential mkl ilbraries.

Can you clarify the syntax of Njobs-Mnodes is doing? Are your jobs just single processor jobs (no MPI, or you are just using one core)?

In any case, the behavior you are seeing is not unexpected. As you add more processes (jobs) to a single node, each one has to share cache and memory bandwidth. For codes that are memory bound, you will see little performance increase from the additional cores. My guess is that this is what you are seeing.

Depending on how your jobs are laid out (I don't understand the description of your data) you may be having conflicts between different jobs running on the same node.

It is best to try and dedicate nodes to a single job to elminate resource contention. However, I wouldn't leave cores idle to accomplish this.

TimP · ‎09-30-2008

You'd likely find more people on this forum familiar with Intel MPI than with mvapich2.

For best single node multiple MPI process performance, you should optimize mapping of processes to cores, for example with as much inter-process communication as possible between the processes running on the same cache, and least communication between processes on different sockets, using shared memory communication among the processes on the same node. If the total memory requirement of the processes exceeds available RAM orcache, you will take a loss in performance per process when using all cores, rather than 1 core per L2 cache.

The version of mvapich1 which I used most recently set its own process affinities; earlier versions allowed the user to set them by taskset. Intel and HP MPI have built-in affinity and communication device options. I don't know whether mvapich2 implements shared memory communication; earlier mpich versions did not normally support it.