I have a setup of Infiniband network where we are testing the performance of FDRs. I have two FDRs on the sender and one FDR on each receiver. There are two receivers. The idea is to run parallel sends from the sender on each FDR and receive it at receiver. We were trying mvapich, but mvapich stated that they clearly don't support such a network.
I was wondering if Intel MPI support such a network. And if we can do something like:
mpirun -n 2 -hosts Sender,Receiver1 -env MV2_IBA_HCA=mlx4_0 ./exec : -n 2 -hosts Sender,Receiver2 -env MV2_IBA_HCA=mlx4_1 ./exec
where mlx4_0 and mlx4_1 are the ids of the FDR cards. So I am trying to run ./exec parallel on different FDR cards to send data to both the receivers at the same time.
Is this possible using Intel MPI. If someone has a similar setup then please let me know.
This is not a supported method. The best suggestion I have is to try using OFED* multiple adapter capability. To do this, set
on all of the ranks. On the ranks on Sender, set
and on the ranks on receiver, set
I don't know if this will work, and I don't have a system to test it on. I'm asking our developers for any additional information they may have.
Technical Consulting Engineer
Intel® Cluster Tools
Ok, I stand corrected, this is a supported model. What you'll need to do is set
[plain]mpirun -n 2 -hosts Sender,Receiver1 -env I_MPI_DAPL_PROVIDER=mlx4_0 ./exec : -n 2 -hosts Sender,Receiver2 -env I_MPI_DAPL_PROVIDER=mlx4_1 ./exec[/plain]
Thanks James for your updates. I was also trying few combinations and I figured out that this command:
mpirun -genv I_MPI_DEBUG 5 -genv I_MPI_FABRICS shm:ofa -n 1 -host Sender -env I_MPI_OFA_ADAPTER_NAME mlx4_0 ./exec : -n 1 -host Sender -env I_MPI_OFA_ADAPTER_NAME mlx4_1 ./exec : -n 1 -host Receiver1 -env I_MPI_OFA_ADAPTER_NAME mlx4_0 ./exec : -n 1 -host -env I_MPI_OFA_ADAPTER_NAME mlx4_0 Receiver2 ./exec
And as you have mentioned in the comment, Intel-MPI doesn't behave exactly as mvapich. So "-n 2 -hosts Sender,Receiver1" starts 2 processes on Sender and I see no executable running on Receiver1. If you notice above command, I have one line for each node. In this way I was able to run it.