MPI message rate scaling with number of peers

ingen23 · ‎05-10-2012

Hi.

I have some MPI code, where small messages (LEN = 1-128 bytes)
from one [host] node are sent to several peers. When I send messages
1-per-peer, like this:

for (i = 0; i < ITER_NUM; ++i)
{
for (k = 1; k < NODES; ++k)
{
MPI_Isend(S_BUF, LEN, MPI_CHAR,
k, 0, MPI_COMM_WORLD, &reqs[nreqs++]);
}
if (nreqs / WINDOW > 0 || i == ITER_NUM - 1)
{
MPI_Waitall(nreqs, reqs, MPI_STATUSES_IGNORE);
nreqs = 0;
}
}

message rate falls down from 11.5 million messages/sec on 5-nodes config (1 host and 4 peers)
to 6.5 million/sec on 17-nodes setup (1 host and 16 peers). When I try to change cycle order like this:

for (k = 1; k < NODES; ++k)
{
for (i = 0; i < ITER_NUM; ++i)
{
MPI_Isend(S_BUF, LEN, MPI_CHAR,
k, 0, MPI_COMM_WORLD, &reqs[nreqs++]);
}
if (nreqs / WINDOW > 0 || i == ITER_NUM - 1)
{
MPI_Waitall(nreqs, reqs, MPI_STATUSES_IGNORE);
nreqs = 0;
}
}
it works well (stable scaling, 11.5 million/sec).
ITER_NUMis about 100 000, and WINDOW there are MPI_Barrier() and time measurement.
Can someone help me, what are the reasons of message rate degrading?
Please, do not recommend message coalescing. And what should I try to
improve scaling? Eager protocol is used (in MPI), and rendezvous usage
did not help.

Second question - I tried some other test. All nodes form node pairs((0, 1), (2, 3), ... (n - 2, n - 1)) and simple send-recv are used. Whennumber of node pairs grows large (256 and higher), both messagerate and bandwidth per-pair degrade significantly. At the same time,one can expect fat tree to scale nicely in this situation. Any ideas?
System config:

2 x Intel Xeon X5570
InfiniBand QDR (fat tree)
Intel MPI 4.0.1
Intel C++ compiler 12.0

James_T_Intel · ‎05-10-2012

Hi ingen,

What type of receive are you using? Are you using the MPD process manager, or Hydra?

Sincerely,
James Tullos
Technical Consulting Engineer
Intel Cluster Tools

ingen23 · ‎05-11-2012

Hi, James.

I am using MPI_Irecv() (but it worked the same way with MPI_Recv(), too).
Hydra (mpiexec.hydra) is used.

IDZ_A_Intel · ‎05-11-2012

Hi ingen,

I'm trying to get some additional information on why this behavior is occurring. I believe you are seeing two effects. The change at 17 ranks is likely due to running on multiple nodes, whereas 16 should run on a single node. This will require a change from shared memory to Infiniband.

The second effect ispossibly due to the network layer. Opening and closing a network connection takes time, and these connections may not stay open between communications. By sending one message to a process at a time, you are frequently opening and closing connections. Sending all messages to one process allows the connection to remain open.

I still need to look into the second issue with the node pairs.

Sincerely,
James Tullos
Technical Consulting Engineer
Intel Cluster Tools

ingen23 · ‎05-21-2012

Thanks for your reply, James.

Sorry, I was not explicit about MPI processes mapping - each process is on different node (I am sure). And there are total 16 processes, first process is communicating with 15 peers.

In IntelMPI reference they say, that I_MPI_DYNAMIC_CONNECTION is set to "off" state by default when using less then 64 MPI procs. So, i think thats not the case here, but I believe there is something in this idea, about connection management. Currently I have no access to the cluster, when I'll try some - I will post the results here.