Intel® MPI Library
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.
2159 Discussions

Error when executing mpirun on 2 nodes

jabuin
Beginner
1,578 Views

Hi!

I'am have 2 nodes and headnode.

OS: centos 5.5, mpi-rt-4.0.0.028 installed.

mdp-ring executing normally:

[root@head ~]# mpdboot -d -v -r ssh -f /root/mpd.hosts -n 3
debug: starting
running mpdallexit on head.kazntu.local
LAUNCHED mpd on head.kazntu.local via
debug: launch cmd= env I_MPI_JOB_TAGGED_PORT_OUTPUT=1 /opt/intel/mpi-rt/4.0.0/bin64/mpd.py --ncpus=1 --myhost=head.kazntu.local -e -d -s 3
debug: mpd on head.kazntu.local on port 45035
RUNNING: mpd on head.kazntu.local
debug: info for running mpd: {'ip': '', 'ncpus': 1, 'list_port': 45035, 'entry_port': '', 'host': 'head.kazntu.local', 'entry_host': '', 'ifhn': ''}
LAUNCHED mpd on node01 via head.kazntu.local
debug: launch cmd= ssh -x -n -q node01 env I_MPI_JOB_TAGGED_PORT_OUTPUT=1 HOSTNAME=$HOSTNAME /opt/intel/mpi-rt/4.0.0/bin64/mpd.py -h head.kazntu.local -p 45035 --ifhn=192.168.192.21 --ncpus=1 --myhost=node01 --myip=192.168.192.21 -e -d -s 3
LAUNCHED mpd on node02 via head.kazntu.local
debug: launch cmd= ssh -x -n -q node02 env I_MPI_JOB_TAGGED_PORT_OUTPUT=1 HOSTNAME=$HOSTNAME /opt/intel/mpi-rt/4.0.0/bin64/mpd.py -h head.kazntu.local -p 45035 --ifhn=192.168.192.22 --ncpus=1 --myhost=node02 --myip=192.168.192.22 -e -d -s 3
debug: mpd on node01 on port 43150
RUNNING: mpd on node01
debug: info for running mpd: {'ip': '192.168.192.21', 'ncpus': 1, 'list_port': 43150, 'entry_port': 45035, 'host': 'node01', 'entry_host': 'head.kazntu.local', 'ifhn': '', 'pid': 6272}
debug: mpd on node02 on port 43164
RUNNING: mpd on node02
debug: info for running mpd: {'ip': '192.168.192.22', 'ncpus': 1, 'list_port': 43164, 'entry_port': 45035, 'host': 'node02', 'entry_host': 'head.kazntu.local', 'ifhn': '', 'pid': 6273}
[root@head ~]#

but mpirun crashed:

[root@head ~]# mpirun -n 8 -wdir /linpack/ -host node01 /linpack/xhpl_em64t : -host node02 /linpack/xhpl_em64t
[cli_0]: got unexpected response to put :cmd=unparseable_msg rc=-1
:
[cli_0]: aborting job:
Fatal error in MPI_Init: Other MPI error, error stack:
MPIR_Init_thread(283): Initialization failed
MPIDD_Init(98).......: channel initialization failed
MPIDI_CH3_Init(163)..: generic failure with errno = 336068751
(unknown)(): Other MPI error
rank 0 in job 1 head.kazntu.local_53559 caused collective abort of all ranks
exit status of rank 0: return code 13
[root@head ~]#

How to execute mpirun?

Need help!

p.s. Sorry for my English..

0 Kudos
8 Replies
jabuin
Beginner
1,578 Views

When i take mpirun with I_MPI_DEBUGoption:

[root@head ~]# mpirun -n 16 -env I_MPI_DEBUG 2 -wdir /linpack/ /linpack/xhpl_em64t
[1] MPI startup(): cannot open dynamic library libdat.so[2] MPI startup(): cannot open dynamic library libdat.so
[2] MPI startup(): cannot open dynamic library libdat2.so

[3] MPI startup(): cannot open dynamic library libdat.so
[3] MPI startup(): cannot open dynamic library libdat2.so
[1] MPI startup(): cannot open dynamic library libdat2.so
[0] MPI startup(): cannot open dynamic library libdat.so
[5] MPI startup(): cannot open dynamic library libdat.so
[5] MPI startup(): cannot open dynamic library libdat2.so
[0] MPI startup(): cannot open dynamic library libdat2.so
[8] MPI startup(): cannot open dynamic library libdat.so
[8] MPI startup(): cannot open dynamic library libdat2.so
[7] MPI startup(): cannot open dynamic library libdat.so
[6] MPI startup(): cannot open dynamic library libdat.so
[6] MPI startup(): cannot open dynamic library libdat2.so
[7] MPI startup(): cannot open dynamic library libdat2.so
[4] MPI startup(): cannot open dynamic library libdat.so
[cli_0]: got unexpected response to put :cmd=unparseable_msg rc=-1
:
[cli_0]: aborting job:
Fatal error in MPI_Init: Other MPI error, error stack:
MPIR_Init_thread(283): Initialization failed
MPIDD_Init(98).......: channel initialization failed
MPIDI_CH3_Init(163)..: generic failure with errno = 336068751
(unknown)(): Other MPI error
[4] MPI startup(): cannot open dynamic library libdat2.so
[12] MPI startup(): cannot open dynamic library libdat.so
[12] MPI startup(): cannot open dynamic library libdat2.so
[14] MPI startup(): cannot open dynamic library libdat.so
[14] MPI startup(): cannot open dynamic library libdat2.so
[10] MPI startup(): cannot open dynamic library libdat.so
[10] MPI startup(): cannot open dynamic library libdat2.so
[15] MPI startup(): cannot open dynamic library libdat.so
[15] MPI startup(): cannot open dynamic library libdat2.so
[9] MPI startup(): cannot open dynamic library libdat.so
[9] MPI startup(): cannot open dynamic library libdat2.so
[11] MPI startup(): cannot open dynamic library libdat.so
[11] MPI startup(): cannot open dynamic library libdat2.so
[13] MPI startup(): cannot open dynamic library libdat.so
[13] MPI startup(): cannot open dynamic library libdat2.so
rank 0 in job 1 head.kazntu.local_57440 caused collective abort of all ranks
exit status of rank 0: return code 13

0 Kudos
jabuin
Beginner
1,578 Views
Where it is possible to take these libraries? What for they are necessary?
0 Kudos
Dmitry_K_Intel2
Employee
1,578 Views
Hi Jabuin,

libdat2.so is part of OFED stack. You can download it from http://www.openfabrics.org/downloads/OFED/ofed-1.5.2/
If you configure and install this package everything should be fine.

Might be this package has already been installed but not in default directory. Could you please check that this library is available either from pathes mentioned in /etc/ld.so.conf or in $LD_LIBRARY_PATH.

If your DAPL library is not properly configured you can try socket connection:
'mpirun -n 16 -nolocal -env I_MPI_FABRICS shm:tcp /linpack/xhpl_em64t'
Please try out this command line and let me know the result.


If you are using 'mpirun' you don't need to use 'mpdboot' before. If you are using 'mpdboot' please use 'mpiexec'.

Regards!
Dmitry
0 Kudos
jabuin
Beginner
1,578 Views

Hi Dmitry,

without OFED stack:

[root@head ~]# mpirun -n 16 -env I_MPI_FABRICS shm:tcp /linpack/xhpl_em64t
[cli_0]: got unexpected response to put :cmd=unparseable_msg rc=-1
:
[cli_0]: aborting job:
Fatal error in MPI_Init: Other MPI error, error stack:
MPIR_Init_thread(283): Initialization failed
MPIDD_Init(98).......: channel initialization failed
MPIDI_CH3_Init(163)..: generic failure with errno = 336068751
(unknown)(): Other MPI error
rank 0 in job 1 head.kazntu.local_53450 caused collective abort of all ranks
exit status of rank 0: return code 13
[root@head ~]#

After installation

[root@head ~]# mpirun -n 16 -wdir /linpack/ -env I_MPI_FABRICS shm:tcp -host node01 /linpack/xhpl_em64t : -host node02 /linpack/xhpl_em64t
librdmacm: couldn't read ABI version.
librdmacm: assuming: 4
CMA: unable to get RDMA device list
librdmacm: couldn't read ABI version.
librdmacm: assuming: 4
CMA: unable to get RDMA device list
librdmacm: couldn't read ABI version.
librdmacm: assuming: 4
CMA: unable to get RDMA device list
librdmacm: couldn't read ABI version.
CMA: unable to get RDMA device list

Debugging:

[root@head ~]# mpirun -n 16 -env I_MPI_DEBUG 2 -wdir /linpack/ /linpack/xhpl_em64t
librdmacm: couldn't read ABI version.
librdmacm: assuming: 4
CMA: unable to get RDMA device list
CMA: unable to get RDMA device list
librdmacm: couldn't read ABI version.
librdmacm: assuming: 4
librdmacm: couldn't read ABI version.
librdmacm: assuming: 4
librdmacm: couldn't read ABI version.
librdmacm: assuming: 4
librdmacm: couldn't read ABI version.
librdmacm: assuming: 4
CMA: unable to get RDMA device list
CMA: unable to get RDMA device list
CMA: unable to get RDMA device list

0 Kudos
Dmitry_K_Intel2
Employee
1,578 Views
Your name is Eugeny, isn't it?
Is your e-mail address real? Might be it would be better to communicate through e-mail.

Can I get access to your cluster?

Did you recompile xhpl with iMPI library? Could you provide 'ldd xhpl' output?

The error looks very strange - mpdman.py cannot parse a message. It means that either message incorrect or contains unexpected symbols. We've never met such cases before.

Regards!
Dmitry
0 Kudos
jabuin
Beginner
1,578 Views
Yes, it's my real e-mail. Can we speak Russian?
0 Kudos
Dmitry_K_Intel2
Employee
1,578 Views
Further communication goes via e-mail.
0 Kudos
jabuin
Beginner
1,578 Views

Having corrected a file /etc/dat.conf and having executed /sbin/modinfo rdma_ucm on compute nodes, I have come to following errors:

[root@head ~]# mpirun -n 16 -env I_MPI_DEBUG 2 -wdir /linpack/ /linpack/xhpl_em64t
node02.kazntu.local:9961: open_hca: rdma_bind ERR No such device. Is eth0 configured?
node02.kazntu.local:9960: open_hca: rdma_bind ERR No such device. Is eth0 configured?
node02.kazntu.local:9955: open_hca: rdma_bind ERR No such device. Is eth0 configured?
node02.kazntu.local:9959: open_hca: rdma_bind ERR No such device. Is eth0 configured?
node02.kazntu.local:9957: open_hca: rdma_bind ERR No such device. Is eth0 configured?
node01.kazntu.local:10329: open_hca: rdma_bind ERR No such device. Is eth0 configured?
node01.kazntu.local:10327: open_hca: rdma_bind ERR No such device. Is eth0 configured?
node02.kazntu.local:9962: open_hca: rdma_bind ERR No such device. Is eth0 configured?
node01.kazntu.local:10324: open_hca: rdma_bind ERR No such device. Is eth0 configured?
node01.kazntu.local:10323: open_hca: rdma_bind ERR No such device. Is eth0 configured?
node01.kazntu.local:10325: open_hca: rdma_bind ERR No such device. Is eth0 configured?
node02.kazntu.local:9958: open_hca: rdma_bind ERR No such device. Is eth0 configured?
node01.kazntu.local:10328: open_hca: rdma_bind ERR No such device. Is eth0 configured?
node01.kazntu.local:10326: open_hca: rdma_bind ERR No such device. Is eth0 configured?
node02.kazntu.local:9956: open_hca: rdma_bind ERR No such device. Is eth0 configured?
[cli_0]: got unexpected response to put :cmd=unparseable_msg rc=-1
:
[cli_0]: aborting job:
Fatal error in MPI_Init: Other MPI error, error stack:
MPIR_Init_thread(283): Initialization failed
MPIDD_Init(98).......: channel initialization failed
MPIDI_CH3_Init(163)..: generic failure with errno = 336068751
(unknown)(): Other MPI error
node01.kazntu.local:10330: open_hca: rdma_bind ERR No such device. Is eth0 configured?
rank 0 in job 1 head.kazntu.local_38570 caused collective abort of all ranks
exit status of rank 0: return code 13
[root@head ~]#

/etc/dat.conf:

[root@node01 ~]# cat /etc/dat.conf
#OpenIB-cma u1.2 nonthreadsafe default libdaplcma.so.1 dapl.1.2 "ib0 0" ""
#OpenIB-cma-1 u1.2 nonthreadsafe default libdaplcma.so.1 dapl.1.2 "ib1 0" ""
#OpenIB-mthca0-1 u1.2 nonthreadsafe default libdaplscm.so.1 dapl.1.2 "mthca0 1" ""
#OpenIB-mthca0-2 u1.2 nonthreadsafe default libdaplscm.so.1 dapl.1.2 "mthca0 2" ""
#OpenIB-mlx4_0-1 u1.2 nonthreadsafe default libdaplscm.so.1 dapl.1.2 "mlx4_0 1" ""
#OpenIB-mlx4_0-2 u1.2 nonthreadsafe default libdaplscm.so.1 dapl.1.2 "mlx4_0 2" ""
#OpenIB-ipath0-1 u1.2 nonthreadsafe default libdaplscm.so.1 dapl.1.2 "ipath0 1" ""
#OpenIB-ipath0-2 u1.2 nonthreadsafe default libdaplscm.so.1 dapl.1.2 "ipath0 2" ""
#OpenIB-ehca0-1 u1.2 nonthreadsafe default libdaplscm.so.1 dapl.1.2 "ehca0 1" ""
#OpenIB-iwarp u1.2 nonthreadsafe default libdaplcma.so.1 dapl.1.2 "eth0 0" ""
OpenIB-cma-roe-eth0 u1.2 nonthreadsafe default libdaplcma.so.1 dapl.1.2 "eth0 0" ""
#OpenIB-cma-roe-eth2 u1.2 nonthreadsafe default libdaplcma.so.1 dapl.1.2 "eth2 0" ""
#OpenIB-cma-roe-eth3 u1.2 nonthreadsafe default libdaplcma.so.1 dapl.1.2 "eth3 0" ""
#OpenIB-scm-roe-mlx4_0-1 u1.2 nonthreadsafe default libdaplscm.so.1 dapl.1.2 "mlx4_0 1" ""
#OpenIB-scm-roe-mlx4_0-2 u1.2 nonthreadsafe default libdaplscm.so.1 dapl.1.2 "mlx4_0 2" ""
#ofa-v2-mlx4_0-1 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mlx4_0 1" ""
#ofa-v2-mlx4_0-2 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mlx4_0 2" ""
#ofa-v2-ib0 u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "ib0 0" ""
#ofa-v2-ib1 u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "ib1 0" ""
#ofa-v2-mthca0-1 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mthca0 1" ""
#ofa-v2-mthca0-2 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mthca0 2" ""
#ofa-v2-ipath0-1 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "ipath0 1" ""
#ofa-v2-ipath0-2 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "ipath0 2" ""
#ofa-v2-ehca0-1 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "ehca0 1" ""
#ofa-v2-iwarp u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "eth2 0" ""
#ofa-v2-mlx4_0-1u u2.0 nonthreadsafe default libdaploucm.so.2 dapl.2.0 "mlx4_0 1" ""
#ofa-v2-mlx4_0-2u u2.0 nonthreadsafe default libdaploucm.so.2 dapl.2.0 "mlx4_0 2" ""
#ofa-v2-mthca0-1u u2.0 nonthreadsafe default libdaploucm.so.2 dapl.2.0 "mthca0 1" ""
#ofa-v2-mthca0-2u u2.0 nonthreadsafe default libdaploucm.so.2 dapl.2.0 "mthca0 2" ""
ofa-v2-cma-roe-eth0 u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "eth0 0" ""
#ofa-v2-cma-roe-eth2 u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "eth2 0" ""
#ofa-v2-cma-roe-eth3 u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "eth3 0" ""
#ofa-v2-scm-roe-mlx4_0-1 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mlx4_0 1" ""
#ofa-v2-scm-roe-mlx4_0-2 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mlx4_0 2" ""
[root@node01 ~]#

ifconfig -a:

[root@node01 ~]# ifconfig -a
eth0 Link encap:Ethernet HWaddr 00:23:8B:BD:5F:D2
inet addr:192.168.192.21 Bcast:192.168.195.255 Mask:255.255.252.0
inet6 addr: fe80::223:8bff:febd:5fd2/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:415397 errors:0 dropped:0 overruns:0 frame:0
TX packets:406833 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:120265102 (114.6 MiB) TX bytes:116642856 (111.2 MiB)
Memory:fa9e0000-faa00000

eth1 Link encap:Ethernet HWaddr 00:23:8B:BD:5F:D3
BROADCAST MULTICAST MTU:1500 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)
Memory:fa960000-fa980000

eth2 Link encap:Ethernet HWaddr 00:23:8B:BD:5F:D4
BROADCAST MULTICAST MTU:1500 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)
Memory:faae0000-fab00000

eth3 Link encap:Ethernet HWaddr 00:23:8B:BD:5F:D5
BROADCAST MULTICAST MTU:1500 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)
Memory:faa60000-faa80000

lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
inet6 addr: ::1/128 Scope:Host
UP LOOPBACK RUNNING MTU:16436 Metric:1
RX packets:62800 errors:0 dropped:0 overruns:0 frame:0
TX packets:62800 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:8157453 (7.7 MiB) TX bytes:8157453 (7.7 MiB)

sit0 Link encap:IPv6-in-IPv4
NOARP MTU:1480 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)

[root@node01 ~]#

p.s.Changes occur slowly as the equipment test.

Please answer via e-mail :)

0 Kudos
Reply