We recently got a 16node cluster with dual processor QC-E5430 Xeon and 16GB RAM/node, all connected with infiniband. I compiled the program we use with ifort 10.1.017, mkl 10.1.014, mvapich2, scalapack, blacs. When doing the performance tests,I noticed that intra-node job distribution is taking more time to complete than inter-node job distribution.
See the table below for some numbers:
My question is: is this result to be expected? And how to increase the performance when 8jobs are assigned to a node i.e., 1job/core? I am using sequential mkl ilbraries.
Can you clarify the syntax of Njobs-Mnodes is doing? Are your jobs just single processor jobs (no MPI, or you are just using one core)?
In any case, the behavior you are seeing is not unexpected. As you add more processes (jobs) to a single node, each one has to share cache and memory bandwidth. For codes that are memory bound, you will see little performance increase from the additional cores. My guess is that this is what you are seeing.
Depending on how your jobs are laid out (I don't understand the description of your data) you may be having conflicts between different jobs running on the same node.
It is best to try and dedicate nodes to a single job to elminate resource contention. However, I wouldn't leave cores idle to accomplish this.