Intel® MPI Library
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.
2235 Discussions

Problem Using Intel MPI with distributed CnC

ravi_mul
Beginner
1,313 Views

 am using intel mpirun to run the distributed programs I observe the following strange behavior, I run a starpu distributed code on the cluster and it runs totally fine. I run a cnc distributed code I get infiniband errors. 

This command runs fine

mpirun -n 8 -ppn 1 -hostfile ~/hosts -genv I_MPI_DEBUG 5  ./dist_floyd_starpu.exe -n 10 -b 1 -starpu_dist

This command generates spew

mpirun -n 8 -ppn 1 -hostfile ~/hosts -genv I_MPI_DEBUG 5  -genv DIST_CNC=MPI ./dist_floyd_cnc.exe -n 10 -b 1 -cnc_dist

[0] MPI startup(): Intel(R) MPI Library, Version 4.1.0 Build 20130116
[0] MPI startup(): Copyright (C) 2003-2013 Intel Corporation. All rights reserved.
[2] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-mlx4_0-1
[2] I_MPI_dlopen_dat(): trying to load default dat library: libdat2.so.2
[3] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-mlx4_0-1
[4] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-mlx4_0-1
[3] I_MPI_dlopen_dat(): trying to load default dat library: libdat2.so.2
[4] I_MPI_dlopen_dat(): trying to load default dat library: libdat2.so.2
[6] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-mlx4_0-1
[6] I_MPI_dlopen_dat(): trying to load default dat library: libdat2.so.2
[1] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-mlx4_0-1
[1] I_MPI_dlopen_dat(): trying to load default dat library: libdat2.so.2
[0] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-mlx4_0-1
[0] I_MPI_dlopen_dat(): trying to load default dat library: libdat2.so.2
[5] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-mlx4_0-1
[5] I_MPI_dlopen_dat(): trying to load default dat library: libdat2.so.2
[7] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-mlx4_0-1
[7] I_MPI_dlopen_dat(): trying to load default dat library: libdat2.so.2
node3.local:7764: open_hca: device mlx4_0 not found
[2] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-mlx4_0-2
[2] I_MPI_dlopen_dat(): trying to load default dat library: libdat2.so.2
node3.local:7764: open_hca: device mlx4_0 not found
node7.local:7c3a: open_hca: device mlx4_0 not found
[6] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-mlx4_0-2
[2] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-ib0
[2] I_MPI_dlopen_dat(): trying to load default dat library: libdat2.so.2
[6] I_MPI_dlopen_dat(): trying to load default dat library: libdat2.so.2
node7.local:7c3a: open_hca: device mlx4_0 not found
[6] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-ib0
[6] I_MPI_dlopen_dat(): trying to load default dat library: libdat2.so.2
node6.local:5f98: open_hca: device mlx4_0 not found
librdmacm: couldn't read ABI version.
librdmacm: assuming: 4
[5] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-mlx4_0-2
[5] I_MPI_dlopen_dat(): trying to load default dat library: libdat2.so.2
CMA: unable to open /dev/infiniband/rdma_cm
node4.local:75c: open_hca: device mlx4_0 not found
node5.local:1ce2: open_hca: device mlx4_0 not found
librdmacm: couldn't read ABI version.
node6.local:5f98: open_hca: device mlx4_0 not found

Both executables link to the same libraries

ldd dist_floyd_cnc.exe 
libcnc.so => /opt/intel/cnc/0.7/lib/intel64/libcnc.so (0x00002b4292bc0000)
libtbb.so.2 => /home1/intel/composer_xe_2013.1.117/tbb/lib/intel64/libtbb.so.2 (0x00002b4292cff000)
libtbbmalloc.so.2 => /home1/intel/composer_xe_2013.1.117/tbb/lib/intel64/libtbbmalloc.so.2 (0x00002b4292e4b000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x00000039e7800000)
librdmacm.so.1 => /usr/lib64/librdmacm.so.1 (0x00002b4292fbe000)
libibverbs.so.1 => /usr/lib64/libibverbs.so.1 (0x0000003550200000)
libibumad.so.2 => /usr/lib64/libibumad.so.2 (0x0000003550e00000)
libdl.so.2 => /lib64/libdl.so.2 (0x0000003550600000)
librt.so.1 => /lib64/librt.so.1 (0x0000003552200000)
libm.so.6 => /lib64/libm.so.6 (0x00002b42931c8000)
libiomp5.so => /home1/intel/composer_xe_2013.1.117/compiler/lib/intel64/libiomp5.so (0x00002b429344b000)
libstdc++.so.6 => /usr/lib64/libstdc++.so.6 (0x0000003559800000)
libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x0000003555e00000)
libc.so.6 => /lib64/libc.so.6 (0x000000354fe00000)
/lib64/ld-linux-x86-64.so.2 (0x000000354fa00000)

ldd dist_floyd_starpu.exe 
libstarpu-1.0.so.1 => /usr/local/lib/libstarpu-1.0.so.1 (0x00002aaef6b6b000)
libhwloc.so.5 => /usr/local/lib/libhwloc.so.5 (0x00002aaef6dea000)
libstarpumpi-1.0.so.1 => /usr/local/lib/libstarpumpi-1.0.so.1 (0x00002aaef7014000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x00000039e7800000)
librdmacm.so.1 => /usr/lib64/librdmacm.so.1 (0x00002aaef7252000)
libibverbs.so.1 => /usr/lib64/libibverbs.so.1 (0x0000003550200000)
libibumad.so.2 => /usr/lib64/libibumad.so.2 (0x0000003550e00000)
libdl.so.2 => /lib64/libdl.so.2 (0x0000003550600000)
librt.so.1 => /lib64/librt.so.1 (0x0000003552200000)
libm.so.6 => /lib64/libm.so.6 (0x00002aaef745c000)
libiomp5.so => /home1/intel/composer_xe_2013.1.117/compiler/lib/intel64/libiomp5.so (0x00002aaef76df000)
libstdc++.so.6 => /usr/lib64/libstdc++.so.6 (0x0000003559800000)
libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x0000003555e00000)
libc.so.6 => /lib64/libc.so.6 (0x000000354fe00000)
libglpk.so.0 => /usr/lib64/libglpk.so.0 (0x00002aaef79e3000)
libelf.so.1 => /usr/lib64/libelf.so.1 (0x0000003551200000)
libnuma.so.1 => /usr/lib64/libnuma.so.1 (0x00002aaef7c79000)
libxml2.so.2 => /usr/lib64/libxml2.so.2 (0x000000355cc00000)
libz.so.1 => /usr/lib64/libz.so.1 (0x00002aaef7e7f000)
/lib64/ld-linux-x86-64.so.2 (0x000000354fa00000)

Any help or inputs would be appreciated.

0 Kudos
1 Reply
Frank_S_Intel
Employee
1,313 Views
0 Kudos
Reply