Intel® MPI Library
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.

Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(264): Initialization failed

Kevin_S_1
Beginner
6,407 Views

Hello,

I am running Intel MPI for Intel mp_linpack benchmark (xhpl_em64t).

Steps:

1. I sourced the mpivars.sh from /opt/intel/impi/bin64/mpivars.sh

2. I did "mpdboot -f hostfile"

$ cat hostfile
node 1
node 2

3. I did "mpirun -f hostfile -ppn 1 -np 2 ./xhpl_em64t"

After step 3, errors occured. Below is the error message with I_MPI_DEBUG=50

[0] I_MPI_dlopen_dat(): trying to dlopen default -ldat: libdat.so
[0] my_dlopen(): trying to dlopen: libdat.so
[0] MPI startup(): cannot open dynamic library libdat.so
[0] my_dlopen(): Look for library libdat.so in /opt/intel/impi/4.0.1.007/intel64/lib,/apps/GNU/GCC/4.7.0/lib64,/apps/GNU/GCC/4.7.0/lib,/apps/GNU/MPC/1.0.1/lib,/apps/GNU/GMP/5.1.2/lib,/apps/GNU/MPFR/3.1.2/lib,include ld.so.conf.d/*.conf,,/lib,/usr/lib
[0] my_dlopen(): dlopen failed: libdat.so: cannot open shared object file: No such file or directory
[0] I_MPI_dlopen_dat(): could not open -ldat
[cli_0]: got unexpected response to put :cmd=unparseable_msg rc=-1
:
[0] MPI startup(): Intel(R) MPI Library, Version 3.1  Build 20080331
[0] MPI startup(): Copyright (C) 2003-2008 Intel Corporation.  All rights reserved.
[cli_0]: aborting job:
Fatal error in MPI_Init: Other MPI error, error stack:
MPIR_Init_thread(264): Initialization failed
MPIDD_Init(98).......: channel initialization failed
MPIDI_CH3_Init(183)..: generic failure with errno = 336068751
(unknown)(): Other MPI error
[1] I_MPI_dlopen_dat(): trying to dlopen default -ldat: libdat.so
[1] my_dlopen(): trying to dlopen: libdat.so
[1] MPI startup(): cannot open dynamic library libdat.so
[1] my_dlopen(): Look for library libdat.so in /opt/intel/impi/4.0.1.007/intel64/lib,/apps/GNU/GCC/4.7.0/lib64,/apps/GNU/GCC/4.7.0/lib,/apps/GNU/MPC/1.0.1/lib,/apps/GNU/GMP/5.1.2/lib,/apps/GNU/MPFR/3.1.2/lib,include ld.so.conf.d/*.conf,,/lib,/usr/lib
[1] my_dlopen(): dlopen failed: libdat.so: cannot open shared object file: No such file or directory
[1] I_MPI_dlopen_dat(): could not open -ldat
rank 0 in job 1  fuji382_53442   caused collective abort of all ranks
  exit status of rank 0: return code 13 

 

Would anyone be able to help me? Thank you very much in advance!

Thank you,

Kevin

0 Kudos
3 Replies
James_T_Intel
Moderator
6,407 Views

It seems like your InfiniBand* drivers are installed incorrectly.  Try reinstalling the drivers and re-running.

Also, you don't need to use mpdboot with mpirun.  By default, mpirun uses Hydra, not MPD.

0 Kudos
Kevin_S_1
Beginner
6,407 Views

Hi James,

Thank you for the reply. I have reinstalled OFED-3.12 from www.openfabrics.org. However, I still get the same problem.

Some further information on my system:

 

$ mpirun --version
Intel(R) MPI Library for Linux, 64-bit applications, Version 4.0 Update 1  Build 20100910
Copyright (C) 2003-2010 Intel Corporation.  All rights reserved.

$ cat /etc/dat.conf
# DAT v2.0, v1.2 configuration file
#  
# Each entry should have the following fields:
#
# <ia_name> <api_version> <threadsafety> <default> <lib_path> \
#           <provider_version> <ia_params> <platform_params>
#
# For uDAPL cma provder, <ia_params> is one of the following:
#       network address, network hostname, or netdev name and 0 for port
#
# For uDAPL scm provider, <ia_params> is device name and port
# For uDAPL ucm provider, <ia_params> is device name and port
# For uDAPL iWARP provider, <ia_params> is netdev device name and 0 
# For uDAPL iWARP provider, <ia_params> is netdev device name and 0 
# For uDAPL RoCE provider, <ia_params> is device name and 0 
# 
ofa-v2-mlx4_0-1 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mlx4_0 1" ""
ofa-v2-mlx4_0-2 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mlx4_0 2" ""
ofa-v2-ib0 u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "ib0 0" ""
ofa-v2-ib1 u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "ib1 0" ""
ofa-v2-mthca0-1 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mthca0 1" ""
ofa-v2-mthca0-2 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mthca0 2" ""
ofa-v2-ipath0-1 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "ipath0 1" ""
ofa-v2-ipath0-2 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "ipath0 2" ""
ofa-v2-ehca0-2 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "ehca0 1" ""
ofa-v2-iwarp u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "eth2 0" ""
ofa-v2-mlx4_0-1u u2.0 nonthreadsafe default libdaploucm.so.2 dapl.2.0 "mlx4_0 1" ""
ofa-v2-mlx4_0-2u u2.0 nonthreadsafe default libdaploucm.so.2 dapl.2.0 "mlx4_0 2" ""
ofa-v2-mthca0-1u u2.0 nonthreadsafe default libdaploucm.so.2 dapl.2.0 "mthca0 1" ""
ofa-v2-mthca0-2u u2.0 nonthreadsafe default libdaploucm.so.2 dapl.2.0 "mthca0 2" ""
ofa-v2-cma-roe-eth2 u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "eth2 0" ""
ofa-v2-cma-roe-eth3 u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "eth3 0" ""
ofa-v2-scm-roe-mlx4_0-1 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mlx4_0 1" ""
ofa-v2-scm-roe-mlx4_0-2 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mlx4_0 2" ""
ofa-v2-mcm-1 u2.0 nonthreadsafe default libdaplomcm.so.2 dapl.2.0 "mlx4_0 1" ""
ofa-v2-mcm-2 u2.0 nonthreadsafe default libdaplomcm.so.2 dapl.2.0 "mlx4_0 2" ""
ofa-v2-scif0 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "scif0 1" ""
ofa-v2-scif0-u u2.0 nonthreadsafe default libdaploucm.so.2 dapl.2.0 "scif0 1" ""
ofa-v2-mic0 u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "mic0:ib 1" ""
ofa-v2-mlx4_0-1s u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mlx4_0 1" ""
ofa-v2-mlx4_0-2s u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mlx4_0 2" ""
ofa-v2-mlx4_1-1s u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mlx4_1 1" ""
ofa-v2-mlx4_1-2s u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mlx4_1 2" ""
ofa-v2-mlx4_1-1u u2.0 nonthreadsafe default libdaploucm.so.2 dapl.2.0 "mlx4_1 1" ""
ofa-v2-mlx4_1-2u u2.0 nonthreadsafe default libdaploucm.so.2 dapl.2.0 "mlx4_1 2" ""
ofa-v2-mlx4_0-1m u2.0 nonthreadsafe default libdaplomcm.so.2 dapl.2.0 "mlx4_0 1" ""
ofa-v2-mlx4_0-2m u2.0 nonthreadsafe default libdaplomcm.so.2 dapl.2.0 "mlx4_0 2" ""
ofa-v2-mlx4_1-1m u2.0 nonthreadsafe default libdaplomcm.so.2 dapl.2.0 "mlx4_1 1" ""
ofa-v2-mlx4_1-2m u2.0 nonthreadsafe default libdaplomcm.so.2 dapl.2.0 "mlx4_1 2" ""
ofa-v2-mlx5_0-1s u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mlx5_0 1" ""
ofa-v2-mlx5_0-2s u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mlx5_0 2" ""
ofa-v2-mlx5_1-1s u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mlx5_1 1" ""
ofa-v2-mlx5_1-2s u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mlx5_1 2" ""
ofa-v2-mlx5_0-1u u2.0 nonthreadsafe default libdaploucm.so.2 dapl.2.0 "mlx5_0 1" ""
ofa-v2-mlx5_0-2u u2.0 nonthreadsafe default libdaploucm.so.2 dapl.2.0 "mlx5_0 2" ""
ofa-v2-mlx5_1-1u u2.0 nonthreadsafe default libdaploucm.so.2 dapl.2.0 "mlx5_1 1" ""
ofa-v2-mlx5_1-2u u2.0 nonthreadsafe default libdaploucm.so.2 dapl.2.0 "mlx5_1 2" ""
ofa-v2-mlx5_0-1m u2.0 nonthreadsafe default libdaplomcm.so.2 dapl.2.0 "mlx5_0 1" ""
ofa-v2-mlx5_0-2m u2.0 nonthreadsafe default libdaplomcm.so.2 dapl.2.0 "mlx5_0 2" ""
ofa-v2-mlx5_1-1m u2.0 nonthreadsafe default libdaplomcm.so.2 dapl.2.0 "mlx5_1 1" ""
ofa-v2-mlx5_1-2m u2.0 nonthreadsafe default libdaplomcm.so.2 dapl.2.0 "mlx5_1 2" ""

$ /etc/infiniband/info
prefix=/usr
Kernel=2.6.32-431.29.2.el6.x86_64

Configure options: --with-core-mod --with-user_mad-mod --with-user_access-mod --with-addr_trans-mod --with-mthca-mod --with-mlx4-mod --with-mlx4_en-mod --with-mlx5-mod --with-cxgb3-mod --with-cxgb4-mod --with-nes-mod --with-qib-mod --with-ocrdma-mod --with-ipoib-mod --with-srp-mod --with-nfsrdma-mod

$ lsmod | grep ib
ib_addr                 6285  2 rdma_ucm,rdma_cm
ib_ipoib               80316  0 
ib_cm                  36986  2 rdma_cm,ib_ipoib
ib_uverbs              36126  5 rdma_ucm
ib_umad                11564  4 
libcrc32c               1246  1 iw_nes
ipv6                  318183  88 ip6t_REJECT,nf_conntrack_ipv6,nf_defrag_ipv6,ib_addr,ib_ipoib,ocrdma,iw_cxgb4,cxgb4
ib_qib                389783  0 
mlx5_ib                92954  0 
mlx5_core              77814  1 mlx5_ib
mlx4_ib               128242  1 
ib_sa                  23806  5 rdma_ucm,rdma_cm,ib_ipoib,ib_cm,mlx4_ib
mlx4_core             213339  2 mlx4_en,mlx4_ib
ib_mthca              134119  0 
ib_mad                 38676  6 ib_cm,ib_umad,ib_qib,mlx4_ib,ib_sa,ib_mthca
ib_core                73994  17 rdma_ucm,rdma_cm,iw_cm,ib_ipoib,ib_cm,ib_uverbs,ib_umad,ocrdma,iw_nes,iw_cxgb4,iw_cxgb3,ib_qib,mlx5_ib,mlx4_ib,ib_sa,ib_mthca,ib_mad
compat                 26078  25 rdma_ucm,rdma_cm,iw_cm,ib_addr,ib_ipoib,ib_cm,ib_uverbs,ib_umad,ocrdma,be2net,iw_nes,iw_cxgb4,cxgb4,iw_cxgb3,cxgb3,ib_qib,mlx5_ib,mlx5_core,mlx4_en,mlx4_ib,ib_sa,mlx4_core,ib_mthca,ib_mad,ib_core
0 Kudos
Kevin_S_1
Beginner
6,407 Views

I found the problem! It turns out that my Intel MPI somehow does not work with DAPL v2.0. I installed compatibility with DAPL v1.2 by doing yum remove the existing dapl and yum install the following:

dapl-2.0.34-1.el6.x86_64
compat-dapl-1.2.19-2.el6.x86_64


Now it works. Thank you for the help, James.

Regards,

Kevin

0 Kudos
Reply