I have a cluster with 32 machines. The first 25 machines are on the first rack and the rest 7 machines are on the second rack.
Each rack has a 1Gbps Ethernet switch.
I run a MPI application which uses 32 machines (1 process per host machine).
When I used the network performance benchmark tool like 'iperf' to measure the network speed between the machines, there is no problem (all point-to-point connection within 32 machines can exploit the full bandwidth).
In my application (MPI_Send/MPI_Recv), each mpi process sends a few 4MB sized data to the other machines. (so it is not the message size problem)
I found that the communication speed between the first 25 machines and the next 7 machines was very poor (~ 10 ~ 20 MB/sec)
(The communication speed within the first 25 machines and the next 7 machines are fast; 100 ~ 110 MB/sec)
What is the possible cause here? Is the latency killing it?
What can I do here to improve the performance?
Is there any suggested optimization?