- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi all,
We just owned a new cluster with Infiniband.
Infiniband is setup correctly and all tests passed (IPoIB, RDMA ...)
We are able to run MPI jobs using I_MPI_DEVICE=rdma:ofa-v2-ib0 which seemsgood.
But I have something that bothers me.
The compute nodes have two configured network interfaces :
- one GigE for SGE, cluster management and user's homedirectory access
- one IB for high-performance interconnect and to access a shared NFSoIB data zone
Obviously, RDMA is only available for InfiniBand so when running a job against "rdma:ofa-v2-ib0", it should run over InfiniBand.
BUT (and here my question), when doing a "netstat -tanpu", I can see that all compute processes have sockets opened between compute nodes over the ethernet IP of the nodes !
Is it a normal behavior (like a "heartbeat channel") or should these sockets be bound the the IPoIB interface ?
The machinefile is generated by SGE.
Here are the 3 ways I tried to run the jobs :
- mpirun without specifying a machinefile : IntelMPI detects the hosts selected by SGE, using ethernet hostname -> the job runs OK with rdma but netstat shows sockets between ethernet
- mpirun specifying a machine with ethernet hostname : same as above
- mpirun specifying a machine with Infiniband hostname : IntelMPI startup fails because of hostname failure or mismatch or something like that. I tested connection between nodes using their IB hostname and everything is OK.
Any ideas about that ?
I am panicing for nothing ?
Thanks,
Ionel
Link Copied
3 Replies
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Ionel,
>I am panicking for nothing ?
Oh, no, you're not. You're panicking about everything. :-)
>BUT (and here my question), when doing a "netstat -tanpu", I can see that all compute processes have sockets opened between compute nodes over the ethernet IP of the nodes !
Well, that's correct behavior. Socket connections are opened for communication with Process Manager and for input/output. MPI communication goes through InfiniBand. To be sure add I_MPI_DEBUG=5 to your env vars and you'll see details about provider used for MPI communication.
> mpirun specifying a machine with Infiniband hostname : IntelMPI startup fails
It should work as well. You just need to provide correct hostnames in a machine file. Could you provide you machine file and command line. Might there is something wrong.
But to be honest you don't need to run mpd ring via Infiniband. The volume of communication is too low. Latency - that's another question, but I'm absolutely sure that you'll hardly notice any improvement.
BTW: What library version do you use?
Regards!
Dmitry
>I am panicking for nothing ?
Oh, no, you're not. You're panicking about everything. :-)
>BUT (and here my question), when doing a "netstat -tanpu", I can see that all compute processes have sockets opened between compute nodes over the ethernet IP of the nodes !
Well, that's correct behavior. Socket connections are opened for communication with Process Manager and for input/output. MPI communication goes through InfiniBand. To be sure add I_MPI_DEBUG=5 to your env vars and you'll see details about provider used for MPI communication.
> mpirun specifying a machine with Infiniband hostname : IntelMPI startup fails
It should work as well. You just need to provide correct hostnames in a machine file. Could you provide you machine file and command line. Might there is something wrong.
But to be honest you don't need to run mpd ring via Infiniband. The volume of communication is too low. Latency - that's another question, but I'm absolutely sure that you'll hardly notice any improvement.
BTW: What library version do you use?
Regards!
Dmitry
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Dmitry,
Rank/pid/node pinning shows "ethernet" hostnames :
I understand that the mpd ring will not benefit from running over InfiniBand.
Runs with I_MPI_DEBUG=5 shows that data transfers are made with RDMA :
[bash][83] MPI startup(): RDMA data transfer mode[/bash]
[bash][0] 11 13210 10 mania-5 [0] 12 9664 1 mania-6[/bash]
To be honest, everything looks fine.
I just don't understand why both python processes and computation processes are using ethernet between nodes :
[bash]tcp 0 0 172.31.32.23:44747 172.31.32.25:14211 ESTABLISHED 10113/presti_exe tcp 0 0 172.31.32.23:11123 172.31.32.23:51726 ESTABLISHED 10123/presti_exe tcp 0 0 127.0.0.1:46902 127.0.0.1:47506 ESTABLISHED 10099/python tcp 0 0 172.31.32.23:37183 172.31.32.23:33462 ESTABLISHED 10102/python [/bash]
(details :
ethernet : 172.31.0.0/16
infiniband : 192.168.0.0/24 not routed outside the cluster nodes, hostnames appended with "-ib")
If it uses RDMA, shouldn't I see either the IB network addresses or nothing at all ?
(It's my mistake not to understand correctly how process communication works)
We are running OFED 1.5.1 packaged by Mellanox for CentOS 5.4 running IntelMPI 4.0.
Thanks,
Ionel
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Ionel,
For Intel MPI Library 4.0 it would be better to use "-genv I_MPI_FABRICS shm:dapl" or "-genv I_MPI_FABRICS shm:ofa". For DAPL provider you can specify provider name "-genv I_MPI_DAPL_PROVIDER ofa-v2-ib0".
Please check your /etc/dat.conf - probably you'll get better performance with another provider (something like 'mlnx4').
>If it uses RDMA, shouldn't I see either the IB network addresses or nothing at all ?
Are you talking about netstat? netstat will not be able to show you usage of IB. InfiniBand has a lot of different utilities (e.g. ibdatacounters) which show you different information about IB activity. IB counters may be located in /sys/class/infiniband/mthca0/device/infiniband:mthca0/ports/1/counters - but it depends on the installation.
Usually each process opens a socket connection with mpd - that is why you see tcp activity. It doesn't mean that rdma uses this port. MPI uses this connectios to pass control information, signals input/output etc.
If you need to force using of IB by mpds - you need to use IB names in mpd.hosts file.
Please let me know if you need more information.
Regards!
Dmitry
For Intel MPI Library 4.0 it would be better to use "-genv I_MPI_FABRICS shm:dapl" or "-genv I_MPI_FABRICS shm:ofa". For DAPL provider you can specify provider name "-genv I_MPI_DAPL_PROVIDER ofa-v2-ib0".
Please check your /etc/dat.conf - probably you'll get better performance with another provider (something like 'mlnx4').
>If it uses RDMA, shouldn't I see either the IB network addresses or nothing at all ?
Are you talking about netstat? netstat will not be able to show you usage of IB. InfiniBand has a lot of different utilities (e.g. ibdatacounters) which show you different information about IB activity. IB counters may be located in /sys/class/infiniband/mthca0/device/infiniband:mthca0/ports/1/counters - but it depends on the installation.
Usually each process opens a socket connection with mpd - that is why you see tcp activity. It doesn't mean that rdma uses this port. MPI uses this connectios to pass control information, signals input/output etc.
If you need to force using of IB by mpds - you need to use IB names in mpd.hosts file.
Please let me know if you need more information.
Regards!
Dmitry

Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page