Intel® MPI Library
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.
2159 Discussions

RDMA in mixed Ethernet/Infiniband cluster

1010
Beginner
1,071 Views
Hi all,

We just owned a new cluster with Infiniband.
Infiniband is setup correctly and all tests passed (IPoIB, RDMA ...)
We are able to run MPI jobs using I_MPI_DEVICE=rdma:ofa-v2-ib0 which seemsgood.
But I have something that bothers me.
The compute nodes have two configured network interfaces :
- one GigE for SGE, cluster management and user's homedirectory access
- one IB for high-performance interconnect and to access a shared NFSoIB data zone
Obviously, RDMA is only available for InfiniBand so when running a job against "rdma:ofa-v2-ib0", it should run over InfiniBand.
BUT (and here my question), when doing a "netstat -tanpu", I can see that all compute processes have sockets opened between compute nodes over the ethernet IP of the nodes !
Is it a normal behavior (like a "heartbeat channel") or should these sockets be bound the the IPoIB interface ?
The machinefile is generated by SGE.
Here are the 3 ways I tried to run the jobs :
- mpirun without specifying a machinefile : IntelMPI detects the hosts selected by SGE, using ethernet hostname -> the job runs OK with rdma but netstat shows sockets between ethernet
- mpirun specifying a machine with ethernet hostname : same as above
- mpirun specifying a machine with Infiniband hostname : IntelMPI startup fails because of hostname failure or mismatch or something like that. I tested connection between nodes using their IB hostname and everything is OK.
Any ideas about that ?
I am panicing for nothing ?
Thanks,
Ionel
0 Kudos
3 Replies
Dmitry_K_Intel2
Employee
1,071 Views
Hi Ionel,

>I am panicking for nothing ?
Oh, no, you're not. You're panicking about everything. :-)

>BUT (and here my question), when doing a "netstat -tanpu", I can see that all compute processes have sockets opened between compute nodes over the ethernet IP of the nodes !
Well, that's correct behavior. Socket connections are opened for communication with Process Manager and for input/output. MPI communication goes through InfiniBand. To be sure add I_MPI_DEBUG=5 to your env vars and you'll see details about provider used for MPI communication.


> mpirun specifying a machine with Infiniband hostname : IntelMPI startup fails
It should work as well. You just need to provide correct hostnames in a machine file. Could you provide you machine file and command line. Might there is something wrong.
But to be honest you don't need to run mpd ring via Infiniband. The volume of communication is too low. Latency - that's another question, but I'm absolutely sure that you'll hardly notice any improvement.

BTW: What library version do you use?

Regards!
Dmitry
0 Kudos
1010
Beginner
1,071 Views
Hi Dmitry,

I understand that the mpd ring will not benefit from running over InfiniBand.
Runs with I_MPI_DEBUG=5 shows that data transfers are made with RDMA :
[bash][83] MPI startup(): RDMA data transfer mode[/bash]
Rank/pid/node pinning shows "ethernet" hostnames :
[bash][0] 11      13210   10      mania-5
[0] 12      9664    1       mania-6[/bash]
To be honest, everything looks fine.
I just don't understand why both python processes and computation processes are using ethernet between nodes :
[bash]tcp        0      0 172.31.32.23:44747          172.31.32.25:14211          ESTABLISHED 10113/presti_exe    
tcp        0      0 172.31.32.23:11123          172.31.32.23:51726          ESTABLISHED 10123/presti_exe    
tcp        0      0 127.0.0.1:46902             127.0.0.1:47506             ESTABLISHED 10099/python        
tcp        0      0 172.31.32.23:37183          172.31.32.23:33462          ESTABLISHED 10102/python        
[/bash]
(details :
ethernet : 172.31.0.0/16
infiniband : 192.168.0.0/24 not routed outside the cluster nodes, hostnames appended with "-ib")
If it uses RDMA, shouldn't I see either the IB network addresses or nothing at all ?
(It's my mistake not to understand correctly how process communication works)
We are running OFED 1.5.1 packaged by Mellanox for CentOS 5.4 running IntelMPI 4.0.
Thanks,
Ionel
0 Kudos
Dmitry_K_Intel2
Employee
1,071 Views
Hi Ionel,

For Intel MPI Library 4.0 it would be better to use "-genv I_MPI_FABRICS shm:dapl" or "-genv I_MPI_FABRICS shm:ofa". For DAPL provider you can specify provider name "-genv I_MPI_DAPL_PROVIDER ofa-v2-ib0".

Please check your /etc/dat.conf - probably you'll get better performance with another provider (something like 'mlnx4').

>If it uses RDMA, shouldn't I see either the IB network addresses or nothing at all ?
Are you talking about netstat? netstat will not be able to show you usage of IB. InfiniBand has a lot of different utilities (e.g. ibdatacounters) which show you different information about IB activity. IB counters may be located in /sys/class/infiniband/mthca0/device/infiniband:mthca0/ports/1/counters - but it depends on the installation.

Usually each process opens a socket connection with mpd - that is why you see tcp activity. It doesn't mean that rdma uses this port. MPI uses this connectios to pass control information, signals input/output etc.

If you need to force using of IB by mpds - you need to use IB names in mpd.hosts file.

Please let me know if you need more information.

Regards!
Dmitry

0 Kudos
Reply