Intel® oneAPI HPC Toolkit
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.
Announcements
Welcome to the Intel Community. If you get an answer you like, please mark it as an Accepted Solution to help others. Thank you!
1829 Discussions

Intel mpi not work on diskless machines?

mityh
Beginner
81 Views
I have built a diskless cluster. mounting root image over NFS through ethernet interfaces
and I want use intel mpi 4.0.2.003 on the cluster.

but it can not be installed, when I type ./install.sh -s cfg.txt -t /scratch/work/tmp
it hangs permenently.

I also tried to install it on a diskfull machine, and specified the target directory to
a nfs-mounted path. and then remount that path to my diskless cluster. but this time it
fails with such messages:

[root@c07b03 work]# /apps/intel/impi/4.0.2.003/bin64/mpdtrace;/apps/intel/impi/4.0.2.003/bin64/mpiexec -machinefile ./nodes -n 24 ./xhpl
ibc07b03
ibc07b04
c07b03:3590: open_hca: rdma_create_id ERR Invalid argument
c07b03:3588: open_hca: rdma_create_id ERR Invalid argument
c07b03:3585: open_hca: rdma_create_id ERR Invalid argument
c07b03:3594: open_hca: rdma_create_id ERR Invalid argument
[cli_0]: got unexpected response to put :cmd=unparseable_msg rc=-1
:
c07b03:3591: open_hca: rdma_create_id ERR Invalid argument
[cli_0]: aborting job:
Fatal error in MPI_Init: Other MPI error, error stack:
MPIR_Init_thread(283): Initialization failed
MPIDD_Init(98).......: channel initialization failed
MPIDI_CH3_Init(163)..: generic failure with errno = 336068751
(unknown)(): Other MPI error
c07b03:3584: open_hca: rdma_create_id ERR Invalid argument
c07b03:3583: open_hca: rdma_create_id ERR Invalid argument
c07b03:3589: open_hca: rdma_create_id ERR Invalid argument
c07b03:3592: open_hca: rdma_create_id ERR Invalid argument
c07b03:3593: open_hca: rdma_create_id ERR Invalid argument
c07b03:3587: open_hca: rdma_create_id ERR Invalid argument
c07b03:3586: open_hca: rdma_create_id ERR Invalid argument
c07b04:3626: open_hca: rdma_create_id ERR Invalid argument
c07b04:3621: open_hca: rdma_create_id ERR Invalid argument
c07b04:3624: open_hca: rdma_create_id ERR Invalid argument
c07b04:3622: open_hca: rdma_create_id ERR Invalid argument
rank 0 in job 1 ibc07b03_38813 caused collective abort of all ranks
exit status of rank 0: return code 13

the following are outputs of mount command
root@c07b03 work]# mount
rootfs on / type rootfs (rw)
none on /proc type proc (rw)
none on /sys type sysfs (rw)
none on /dev type tmpfs (rw)
none on /dev/pts type devpts (rw)
172.16.38.8:/share/apps/hgadmin/hpcgateway/plugins/clusteros/rootimage-rhel5u5-ib1531-lustre185_allinone_mds01 on / type nfs (rw,vers=3,rsize=32768,wsize=32768,soft,intr,nolock,proto=udp,timeo=20,retrans=3,sec=sys,addr=172.16.38.8)
/dev/ram on /ram type tmpfs (rw)
/proc on /proc type proc (rw)
sunpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw)
/proc/bus/usb on /proc/bus/usb type usbfs (rw)
devpts on /dev/pts type devpts (rw)
none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw)
none on /ipathfs type ipathfs (rw)
tmpfs on /dev/shm type tmpfs (rw)
sysfs on /sys type sysfs (rw)
/dev/sda5 on /scratch type ext3 (rw,data=ordered)
/etc/auto.misc on /misc type autofs (rw,fd=7,pgrp=3160,timeout=300,minproto=5,maxproto=5,indirect)
-hosts on /net type autofs (rw,fd=13,pgrp=3160,timeout=300,minproto=5,maxproto=5,indirect)
/etc/auto.home on /home type autofs (rw,fd=19,pgrp=3160,timeout=30,minproto=5,maxproto=5,indirect)
/etc/auto.job on /jobmgr type autofs (rw,fd=25,pgrp=3160,timeout=30,minproto=5,maxproto=5,indirect)
/etc/auto.app on /apps type autofs (rw,fd=31,pgrp=3160,timeout=30,minproto=5,maxproto=5,indirect)
appserver:/apps/intel on /apps/intel type nfs (ro,vers=3,rsize=32768,wsize=32768,soft,intr,proto=udp,timeo=11,retrans=2,sec=sys,addr=appserver)


WHEN i mount appserver:/apps/intel on to a diskfull machine, the application can run correctly.
0 Kudos
4 Replies
Dmitry_K_Intel2
Employee
81 Views
Hello,

About installation: could you attach cfg.txt and
/scratch/work/tmp/intel.*.log file?

About mpiexec error in diskless configuration...
The default path to the DAPL configuration file is /etc/dat.conf and dynamic libraries need to be found in standard search path. It's not clear what you configuration is in case of diskless nodes.
You can add '-env I_MPI_DEBUG 100' to your mpiexec command line and attach the output - I'll take a look.

Regards!
Dmitry

mityh
Beginner
81 Views
Thanks very much for your response.
The Installation problem seems solved. I have removed /var/lib/rpm/__db* files sometimes before.
because I awared the rpmq daemon always start and take over some CPU times, so When I restore
the /var/lib/rpm/__db* files, the install process proceeds successfully.
the cfg.txt file are as following:
PSET_LICENSE_FILE=/apps/tool/intel/intel.lic
ACTIVATION=license_file
AUTOMOUNTED_CLUSTER=yes
UPDATE_LDSOCONF=no
REGISTER_IN_SELECTOR=no
CONTINUE_WITH_INSTALLDIR_OVERWRITE=yes
CONTINUE_WITH_OPTIONAL_ERROR=yes
PSET_INSTALL_DIR=/opt/intel/impi/4.0.2.003
INSTALL_MODE=RPM
ACCEPT_EULA=accept

------------------------------
mpiexec debug info and /etc/dat.conf are attached.
------------------------------
some system information:
[root@c07b03 zws_work]# uname -a
Linux c07b03 2.6.18-194.17.1.0.1.el5_lustre.1.8.5 #1 SMP Mon May 23 22:48:21 CST 2011 x86_64 x86_64 x86_64 GNU/Linux
Infiniband Driver OFED-1.5.3.1
[root@c07b03 lib64]# service openibd status

HCA driver loaded

Configured IPoIB devices:
ib0

Currently active IPoIB devices:

The following OFED modules are loaded:

rdma_ucm
ib_sdp
rdma_cm
ib_addr
ib_ipoib
mlx4_core
mlx4_ib
mlx4_en
ib_mthca
ib_uverbs
ib_umad
ib_sa
ib_cm
ib_mad
ib_core
iw_cxgb3
iw_nes
ib_qib

[root@c07b03 lib64]# ibstat
CA 'mlx4_0'
CA type: MT26428
Number of ports: 1
Firmware version: 2.7.200
Hardware version: a0
Node GUID: 0x002590ffff071b78
System image GUID: 0x002590ffff071b7b
Port 1:
State: Active
Physical state: LinkUp
Rate: 40
Base lid: 311
LMC: 0
SM lid: 1
Capability mask: 0x02510868
Port GUID: 0x002590ffff071b79
Link layer: IB

[root@c07b03 l_mpi_p_4.0.2.003]# ldconfig -p| grep 'libdaplcma.so'
libdaplcma.so.1 (libc6,x86-64) => /usr/lib64/libdaplcma.so.1
libdaplcma.so (libc6,x86-64) => /usr/lib64/libdaplcma.so

[root@c07b03 lib64]# lsmod
Module Size Used by
autofs4 63240 6
ib_qib 521492 1
dm_mirror 54928 0
dm_log 45312 1 dm_mirror
dm_multipath 57112 0
scsi_dh 42368 1 dm_multipath
dm_mod 102096 3 dm_mirror,dm_log,dm_multipath
video 53260 0
backlight 40064 1 video
sbs 50112 0
power_meter 47244 0
hwmon 36744 1 power_meter
i2c_ec 38784 1 sbs
i2c_core 56832 1 i2c_ec
dell_wmi 37664 0
wmi 42176 1 dell_wmi
button 40736 0
battery 44040 0
asus_acpi 50980 0
acpi_memhotplug 40708 0
ac 38920 0
parport_pc 62504 0
lp 47312 0
parport 73356 2 parport_pc,lp
nfs 296652 1
nfs_acl 36864 1 nfs
fscache 52576 1 nfs
lockd 101744 1 nfs
sunrpc 200264 8 nfs,nfs_acl,lockd
iw_nes 213160 0
iw_cxgb3 111316 0
cxgb3 214896 1 iw_cxgb3
serio_raw 40708 0
pcspkr 36480 0
shpchp 71084 0
sg 70568 0
joydev 44032 0
ib_ipoib 115040 0
ipoib_helper 35728 2 ib_ipoib
ib_mthca 157092 0
mlx4_en 113164 0
mlx4_ib 110140 0
mlx4_core 150472 2 mlx4_en,mlx4_ib
ib_sdp 206588 0
rdma_ucm 49152 0
rdma_cm 73492 2 ib_sdp,rdma_ucm
iw_cm 43656 1 rdma_cm
ib_umad 50600 0
ib_uverbs 75696 1 rdma_ucm
ib_cm 71592 2 ib_ipoib,rdma_cm
ib_sa 76424 4 ib_ipoib,rdma_ucm,rdma_cm,ib_cm
ib_mad 72100 6 ib_qib,ib_mthca,mlx4_ib,ib_umad,ib_cm,ib_sa
ib_core 109440 15 ib_qib,iw_nes,iw_cxgb3,ib_ipoib,ib_mthca,mlx4_ib,ib_sdp,rdma_ucm,rdma_cm,iw_cm,ib_umad,ib_uverbs,ib_cm,ib_sa,ib_mad
ib_addr 43016 1 rdma_cm
ipv6 435680 74 ib_ipoib,ib_sdp,rdma_cm,ib_addr
xfrm_nalgo 43524 1 ipv6
crypto_api 43136 1 xfrm_nalgo
ata_piix 57220 0
ext3 169744 1
jbd 104048 1 ext3
uhci_hcd 57624 0
ehci_hcd 66444 0
ohci_hcd 56500 0
ahci 69896 1
libata 209936 2 ata_piix,ahci
sd_mod 61704 2
scsi_mod 198040 4 scsi_dh,sg,libata,sd_mod
igb 123416 0
dca 41412 2 ib_qib,igb
8021q 57616 1 igb

[root@c07b03 lib64]# chkconfig --list
NetworkManager 0:off 1:off 2:off 3:off 4:off 5:off 6:off
acpid 0:off 1:off 2:off 3:off 4:off 5:off 6:off
amd 0:off 1:off 2:off 3:off 4:off 5:off 6:off
anacron 0:off 1:off 2:off 3:off 4:off 5:off 6:off
arptables_jf 0:off 1:off 2:off 3:off 4:off 5:off 6:off
arpwatch 0:off 1:off 2:off 3:off 4:off 5:off 6:off
atd 0:off 1:off 2:off 3:on 4:on 5:on 6:off
auditd 0:off 1:off 2:off 3:off 4:off 5:off 6:off
autofs 0:off 1:off 2:off 3:on 4:on 5:on 6:off
avahi-daemon 0:off 1:off 2:off 3:off 4:off 5:off 6:off
avahi-dnsconfd 0:off 1:off 2:off 3:off 4:off 5:off 6:off
bgpd 0:off 1:off 2:off 3:off 4:off 5:off 6:off
bluetooth 0:off 1:off 2:off 3:off 4:off 5:off 6:off
bootparamd 0:off 1:off 2:off 3:off 4:off 5:off 6:off
capi 0:off 1:off 2:off 3:off 4:off 5:off 6:off
conman 0:off 1:off 2:off 3:off 4:off 5:off 6:off
cpuspeed 0:off 1:off 2:off 3:on 4:off 5:on 6:off
crond 0:off 1:off 2:off 3:on 4:off 5:on 6:off
cups 0:off 1:off 2:off 3:off 4:off 5:off 6:off
cyrus-imapd 0:off 1:off 2:off 3:off 4:off 5:off 6:off
dc_client 0:off 1:off 2:off 3:off 4:off 5:off 6:off
dc_server 0:off 1:off 2:off 3:off 4:off 5:off 6:off
dhcp6r 0:off 1:off 2:off 3:off 4:off 5:off 6:off
dhcp6s 0:off 1:off 2:off 3:off 4:off 5:off 6:off
dhcpd 0:off 1:off 2:off 3:off 4:off 5:off 6:off
dhcrelay 0:off 1:off 2:off 3:off 4:off 5:off 6:off
dnsmasq 0:off 1:off 2:off 3:off 4:off 5:off 6:off
dovecot 0:off 1:off 2:off 3:off 4:off 5:off 6:off
dund 0:off 1:off 2:off 3:off 4:off 5:off 6:off
edac 0:off 1:off 2:off 3:off 4:off 5:off 6:off
fcoe 0:off 1:off 2:off 3:off 4:off 5:off 6:off
firstboot 0:off 1:off 2:off 3:off 4:off 5:off 6:off
gpm 0:off 1:off 2:off 3:off 4:off 5:off 6:off
haldaemon 0:off 1:off 2:off 3:off 4:off 5:off 6:off
hidd 0:off 1:off 2:off 3:off 4:off 5:off 6:off
hplip 0:off 1:off 2:off 3:off 4:off 5:off 6:off
httpd 0:off 1:off 2:off 3:off 4:off 5:off 6:off
innd 0:off 1:off 2:off 3:off 4:off 5:off 6:off
ip6tables 0:off 1:off 2:off 3:off 4:off 5:off 6:off
ipmi 0:off 1:off 2:off 3:off 4:off 5:off 6:off
ipmievd 0:off 1:off 2:off 3:off 4:off 5:off 6:off
ipsec 0:off 1:off 2:off 3:off 4:off 5:off 6:off
iptables 0:off 1:off 2:off 3:off 4:off 5:off 6:off
irda 0:off 1:off 2:off 3:off 4:off 5:off 6:off
irqbalance 0:off 1:off 2:off 3:off 4:off 5:off 6:off
iscsi 0:off 1:off 2:off 3:off 4:off 5:off 6:off
iscsid 0:off 1:off 2:off 3:off 4:off 5:off 6:off
isdn 0:off 1:off 2:off 3:off 4:off 5:off 6:off
kadmin 0:off 1:off 2:off 3:off 4:off 5:off 6:off
kdump 0:off 1:off 2:off 3:off 4:off 5:off 6:off
kprop 0:off 1:off 2:off 3:off 4:off 5:off 6:off
krb524 0:off 1:off 2:off 3:off 4:off 5:off 6:off
krb5kdc 0:off 1:off 2:off 3:off 4:off 5:off 6:off
ktune 0:off 1:off 2:off 3:off 4:off 5:off 6:off
kudzu 0:off 1:off 2:off 3:off 4:off 5:off 6:off
ldap 0:off 1:off 2:off 3:off 4:off 5:off 6:off
lisa 0:off 1:off 2:off 3:off 4:off 5:off 6:off
lm_sensors 0:off 1:off 2:off 3:off 4:off 5:off 6:off
lsf 0:off 1:off 2:off 3:off 4:off 5:off 6:off
lvm2-monitor 0:off 1:off 2:off 3:off 4:off 5:off 6:off
mailman 0:off 1:off 2:off 3:off 4:off 5:off 6:off
mcstrans 0:off 1:off 2:off 3:off 4:off 5:off 6:off
mdmonitor 0:off 1:off 2:off 3:off 4:off 5:off 6:off
mdmpd 0:off 1:off 2:off 3:off 4:off 5:off 6:off
messagebus 0:off 1:off 2:off 3:off 4:off 5:off 6:off
microcode_ctl 0:off 1:off 2:off 3:off 4:off 5:off 6:off
multipathd 0:off 1:off 2:off 3:off 4:off 5:off 6:off
mysqld 0:off 1:off 2:off 3:off 4:off 5:off 6:off
named 0:off 1:off 2:off 3:off 4:off 5:off 6:off
netconsole 0:off 1:off 2:off 3:off 4:off 5:off 6:off
netfs 0:off 1:off 2:off 3:on 4:on 5:on 6:off
netplugd 0:off 1:off 2:off 3:off 4:off 5:off 6:off
network 0:off 1:off 2:on 3:on 4:on 5:on 6:off
nfs 0:off 1:off 2:off 3:off 4:off 5:off 6:off
nfslock 0:off 1:off 2:off 3:on 4:on 5:on 6:off
nscd 0:off 1:off 2:off 3:off 4:off 5:off 6:off
ntpd 0:off 1:off 2:off 3:on 4:off 5:on 6:off
openibd 0:off 1:off 2:on 3:on 4:on 5:on 6:off
opensmd 0:off 1:off 2:off 3:off 4:off 5:off 6:off
ospf6d 0:off 1:off 2:off 3:off 4:off 5:off 6:off
ospfd 0:off 1:off 2:off 3:off 4:off 5:off 6:off
pand 0:off 1:off 2:off 3:off 4:off 5:off 6:off
pcscd 0:off 1:off 2:off 3:off 4:off 5:off 6:off
portmap 0:off 1:off 2:off 3:on 4:on 5:on 6:off
postgresql 0:off 1:off 2:off 3:off 4:off 5:off 6:off
privoxy 0:off 1:off 2:off 3:off 4:off 5:off 6:off
psacct 0:off 1:off 2:off 3:off 4:off 5:off 6:off
radiusd 0:off 1:off 2:off 3:off 4:off 5:off 6:off
radvd 0:off 1:off 2:off 3:off 4:off 5:off 6:off
rarpd 0:off 1:off 2:off 3:off 4:off 5:off 6:off
rawdevices 0:off 1:off 2:off 3:off 4:off 5:off 6:off
rdisc 0:off 1:off 2:off 3:off 4:off 5:off 6:off
readahead_early 0:off 1:off 2:off 3:off 4:off 5:off 6:off
readahead_later 0:off 1:off 2:off 3:off 4:off 5:off 6:off
restorecond 0:off 1:off 2:off 3:off 4:off 5:off 6:off
rhnsd 0:off 1:off 2:off 3:off 4:off 5:off 6:off
ripd 0:off 1:off 2:off 3:off 4:off 5:off 6:off
ripngd 0:off 1:off 2:off 3:off 4:off 5:off 6:off
rpcgssd 0:off 1:off 2:off 3:off 4:off 5:off 6:off
rpcidmapd 0:off 1:off 2:off 3:off 4:off 5:off 6:off
rpcsvcgssd 0:off 1:off 2:off 3:off 4:off 5:off 6:off
rstatd 0:off 1:off 2:off 3:off 4:off 5:off 6:off
rusersd 0:off 1:off 2:off 3:off 4:off 5:off 6:off
rwhod 0:off 1:off 2:off 3:off 4:off 5:off 6:off
saslauthd 0:off 1:off 2:off 3:off 4:off 5:off 6:off
sendmail 0:off 1:off 2:off 3:off 4:off 5:off 6:off
setroubleshoot 0:off 1:off 2:off 3:off 4:off 5:off 6:off
smartd 0:off 1:off 2:off 3:off 4:off 5:off 6:off
smb 0:off 1:off 2:off 3:off 4:off 5:off 6:off
snmpd 0:off 1:off 2:off 3:off 4:off 5:off 6:off
snmptrapd 0:off 1:off 2:off 3:off 4:off 5:off 6:off
spamassassin 0:off 1:off 2:off 3:off 4:off 5:off 6:off
squid 0:off 1:off 2:off 3:off 4:off 5:off 6:off
sshd 0:off 1:off 2:on 3:on 4:on 5:on 6:off
syslog 0:off 1:off 2:on 3:on 4:on 5:on 6:off
sysstat 0:off 1:off 2:on 3:on 4:off 5:on 6:off
tcsd 0:off 1:off 2:off 3:off 4:off 5:off 6:off
tog-pegasus 0:off 1:off 2:off 3:off 4:off 5:off 6:off
tomcat5 0:off 1:off 2:off 3:off 4:off 5:off 6:off
tux 0:off 1:off 2:off 3:off 4:off 5:off 6:off
uuidd 0:off 1:off 2:off 3:off 4:off 5:off 6:off
vncserver 0:off 1:off 2:off 3:off 4:off 5:off 6:off
vsftpd 0:off 1:off 2:off 3:off 4:off 5:off 6:off
watchdog 0:off 1:off 2:off 3:off 4:off 5:off 6:off
wdaemon 0:off 1:off 2:off 3:off 4:off 5:off 6:off
winbind 0:off 1:off 2:off 3:off 4:off 5:off 6:off
wpa_supplicant 0:off 1:off 2:off 3:off 4:off 5:off 6:off
xfs 0:off 1:off 2:on 3:on 4:on 5:on 6:off
xinetd 0:off 1:off 2:off 3:on 4:on 5:on 6:off
ypbind 0:off 1:off 2:off 3:off 4:off 5:off 6:off
yppasswdd 0:off 1:off 2:off 3:off 4:off 5:off 6:off
ypserv 0:off 1:off 2:off 3:off 4:off 5:off 6:off
ypxfrd 0:off 1:off 2:off 3:off 4:off 5:off 6:off
yum-updatesd 0:off 1:off 2:off 3:off 4:off 5:off 6:off
zebra 0:off 1:off 2:off 3:off 4:off 5:off 6:off

xinetd based services:
amanda: off
amandaidx: off
amidxtape: off
auth: off
chargen-dgram: off
chargen-stream: off
cvs: off
daytime-dgram: off
daytime-stream: off
discard-dgram: off
discard-stream: off
echo-dgram: off
echo-stream: off
eklogin: off
ekrb5-telnet: off
gssftp: off
klogin: off
krb5-telnet: off
kshell: off
ktalk: off
ntalk: off
rexec: off
rlogin: off
rmcp: off
rsh: off
rsync: off
talk: off
tcpmux-server: off
telnet: off
tftp: off
time-dgram: off
time-stream: off
uucp: off


Dmitry_K_Intel2
Employee
81 Views
First of all, you running xhpl compiled with Intel MPI 3.2.2
[0] MPI startup(): Intel MPI Library, Version 3.2.2 Build 20090827

It's not clear why it hangs.

There is hello_world sample in the test directory - could you compile it and try to run.
Your default provider will be OpenIB-mlx4_0-1 (you can set it directly '-env I_MPI_DAPL_PROVIDER OpenIB-mlx4_0-1'). But it might be better to use 'ofa-v2-mlx4_0-1' you need to compare performance.

You also may try to set another fabric: -env I_MPI_FABRICS shm:ofa

Regards!
Dmitry

mityh
Beginner
81 Views
Thanks for your comments.

It works fine, by re-buiding the diskless image and reinstalling impi4.0.2.

But I have no idea of what is wrong with the old environment.
Reply