Gbit / Infiniband mix

Hennes_Hoffmann · ‎06-23-2011

Hi,

I am running a heterogeneous cluster, half the nodes Gbit ethernet and the other half Infiniband. For a year or so everything went well, but recently the Gbit nodes complain about the lack of Infiniband (see below). This phaenomenon is limited to impi code, GNU mpi still runs fine.
The problem appears unrelated to the queuing system, a direct launch fails in the same way as a SGE submitted one.
Any help would be greatly appreciated.
...
compute-0-15.local:19848: open_hca: rdma_bind ERR No such device. Is eth0 configured?
compute-0-15.local:19847: open_hca: rdma_bind ERR No such device. Is eth0 configured?
compute-0-15.local:19845: open_hca: getaddr_netdev ERROR: No such device. Is ib1 configured?
compute-0-15.local:19845: open_hca: device mthca0 not found
compute-0-15.local:19845: open_hca: device mthca0 not found
compute-0-15.local:19845: open_hca: device mlx4_0 not found
compute-0-15.local:19845: open_hca: device mlx4_0 not found
compute-0-15.local:19845: open_hca: device ipath0 not found
compute-0-15.local:19845: open_hca: device ipath0 not found
compute-0-15.local:19845: open_hca: device ehca0 not found
compute-0-15.local:19845: open_hca: rdma_bind ERR No such device. Is eth0 configured?
[cli_0]: got unexpected response to put :cmd=unparseable_msg rc=-1
:
[cli_0]: aborting job:
Fatal error in MPI_Init_thread: Other MPI error, error stack:
MPIR_Init_thread(283): Initialization failed
MPIDD_Init(98).......: channel initialization failed
MPIDI_CH3_Init(163)..: generic failure with errno = 336068751
(unknown)(): Other MPI error
rank 0 in job 1 xxxxl_51508 caused collective abort of all ranks
exit status of rank 0: return code 13

Dmitry_K_Intel2 · ‎06-24-2011

Hi Hennes,

Intel MPI Library does not support heterogeneous environment. You need to add I_MPI_FABRICS=shm:tcp to the list of enviroment variables if you are using 4.x

Regards!
Dmitry

Hennes_Hoffmann · ‎06-24-2011

Hi Dmitry,

Thanks for your reply. I should clarify that the different fabrics have their own queues. The error posted above shows up when a job is submitted to Gbit-only nodes. Strangely everything worked well for one year. A recent reboot of the head node broke it, but now it seems impossible to figure which update in particular is causing this. Is there a point in trying softiwarp on the gbit nodes?
-env I_MPI_FABRICS shm:tcp seems not to work with -r ssh. Is there a way to make it work?

Thanks,
Hennes

Dmitry_K_Intel2 · ‎06-24-2011

Hennes,

Have you changed anything on the cluster? Have you changed Intel MPI?
Intel MPI library should work with ssh, but you need to have passwordless connection. So, from node1 you need to be able to run 'mpiexec -n 1 -host node2 hostname'
Is it reproducable on other nodes?
I'm not sure that the issue is related to the library - it seems to me that you need to check eth0 settings.

Regards!
Dmitry

Hennes_Hoffmann · ‎06-24-2011

Dmitry,

The cluster remained unchanged (except RHEL5.5 updates on the head), Intel MPI is 4.0.0.028, installed about 12 months ago and left untouched since then. Passwordless connection works from and to all nodes. The eth0 settings appear valid to me and gnu MPI is running fine on all nodes. ssh as such works, just with "-env I_MPI_FABRICS shm:tcp" it breaks. I noticed this last already year when I did some unrelated tests.

Regards,
Hennes

Dmitry_K_Intel2 · ‎06-27-2011

Hi Hennes,
Could you submit a ticket at premier.intel.com and please attach the output of a run with I_MPI_DEBUG=20

Regards!
Dmitry