(I attached the performance measurement program written in C++)
I am experiencing performance issue during bi-directional MPI_Send/MPI_Recv operations.
The program runs two threads; (One for MPI_Send and the other for MPI_Recv).
- MPI_Recv receives any data from any source.
- MPI_Send sends data to the other nodes one at a time (starting from its own rank, rank+1, ..., 0, ... rank -1)
You can compile the attached file as follows:
$ mpiicpc -O3 -m64 -std=c++11 -mt_mpi -qopenmp ./mpi-test.cpp -o mpi-test
You can test it as follows:
$ mpiexec.hydra -genv I_MPI_PERHOST 1 -genv I_MPI_FABRICS tcp -n 2 -machinefile ./machine_list /home/TESTER/mpi-test
rank --> rank BW=2060.27 [MB/sec]
rank --> rank BW=56.38 [MB/sec]
rank BW=219.21 [MB/sec]
rank BW=217.20 [MB/sec]
$ mpiexec.hydra -genv I_MPI_PERHOST 1 -genv I_MPI_FABRICS tcp -n 4 -machinefile ./machine_list /home/TESTER/mpi-test
rank --> rank BW=2050.59 [MB/sec]
rank --> rank BW=112.35 [MB/sec]
rank --> rank BW=57.19 [MB/sec]
rank --> rank BW=109.64 [MB/sec]
rank BW=218.28 [MB/sec]
rank BW=219.17 [MB/sec]
rank BW=220.75 [MB/sec]
rank BW=221.17 [MB/sec]
What I am observing is that when the data transfer from rank-A to rank-B and from rank-B to rank-A occur simultaneously, the performance drops significantly (almost to half).
The cluster machines use Cent OS 7, 1gbps ethernet that supports full duplex transimission mode.
How can I resolve this issue?
- Does Intel MPI support full-duplex transmission mode between two ranks?
The related benchmark is Multi-Biband benchmark (https://software.intel.com/en-us/node/561908)
Why can't it fully exploit the network bandwidth available? (Why only half of it?) even with full duplex transmission mode?