- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
When I use I_MPI_FABRICS=tcp (on 2 MICs) MPI applications work fine, but when I switch to dapl, it fails with weird messages:
~ $ mpirun -env I_MPI_DEBUG=4 -env I_MPI_MIC=enable -env I_MPI_FABRICS=dapl -f mpi_hosts -perhost 1 -n 2 $MPI_DIR/bin/IMB-MPI1 PingPong
[0] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-mlx4_0-1
[1] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-mlx4_0-1
node001-mic0:43df:5cbb7700: 5188 us(5188 us): open_hca: device mlx4_0 not found
[0] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-mlx4_0-2
node001-mic0:43df:5cbb7700: 6713 us(1525 us): open_hca: device mlx4_0 not found
[0] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-ib0
node001-mic1:3fbc:64958700: 6183 us(6183 us): open_hca: device mlx4_0 not found
[1] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-mlx4_0-2
node001-mic1:3fbc:64958700: 7466 us(1283 us): open_hca: device mlx4_0 not found
[1] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-ib0
librdmacm: couldn't read ABI version.
librdmacm: assuming: 4
CMA: unable to open /dev/infiniband/rdma_cmlibrdmacm: couldn't read ABI version.
librdmacm: assuming: 4
CMA: unable to open /dev/infiniband/rdma_cm
[0] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-ib1
CMA: unable to open /dev/infiniband/rdma_cm
[1] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-ib1
[0] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-mthca0-1
CMA: unable to open /dev/infiniband/rdma_cm
[1] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-mthca0-1
node001-mic0:43df:5cbb7700: 24490 us(17777 us): open_hca: device mthca0 not found
node001-mic1:3fbc:64958700: 19432 us(11966 us): open_hca: device mthca0 not found
[0] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-mthca0-2
[1] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-mthca0-2
node001-mic1:3fbc:64958700: 20396 us(964 us): open_hca: device mthca0 not found
node001-mic0:43df:5cbb7700: 25969 us(1479 us): open_hca: device mthca0 not found
[1] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-ipath0-1
[0] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-ipath0-1
node001-mic1:3fbc:64958700: 21365 us(969 us): open_hca: device ipath0 not found
[1] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-ipath0-2
node001-mic1:3fbc:64958700: 22305 us(940 us): open_hca: device ipath0 not found
node001-mic0:43df:5cbb7700: 27660 us(1691 us): open_hca: device ipath0 not found
[1] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-ehca0-2
[0] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-ipath0-2
node001-mic1:3fbc:64958700: 23250 us(945 us): open_hca: device ehca0 not found
[1] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-iwarp
node001-mic0:43df:5cbb7700: 29157 us(1497 us): open_hca: device ipath0 not found
CMA: unable to open /dev/infiniband/rdma_cm
[1] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-mlx4_0-1u
[0] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-ehca0-2
node001-mic0:43df:5cbb7700: 30485 us(1328 us): open_hca: device ehca0 not found
[0] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-iwarp
CMA: unable to open /dev/infiniband/rdma_cm
[0] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-mlx4_0-1u
node001-mic1:3fbc:64958700: 148 us(148 us): open_hca: device mlx4_0 not found
[1] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-mlx4_0-2u
node001-mic1:3fbc:64958700: 586 us(438 us): open_hca: device mlx4_0 not found
[1] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-mthca0-1u
node001-mic1:3fbc:64958700: 936 us(350 us): open_hca: device mthca0 not found
[1] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-mthca0-2u
node001-mic1:3fbc:64958700: 1246 us(310 us): open_hca: device mthca0 not found
[1] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-cma-roe-eth2
CMA: unable to open /dev/infiniband/rdma_cm
[1] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-cma-roe-eth3
CMA: unable to open /dev/infiniband/rdma_cm
[1] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-scm-roe-mlx4_0-1
node001-mic0:43df:5cbb7700: 129 us(129 us): open_hca: device mlx4_0 not found
[0] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-mlx4_0-2u
node001-mic1:3fbc:64958700: 29111 us(5861 us): open_hca: device mlx4_0 not found
node001-mic0:43df:5cbb7700: 510 us(381 us): open_hca: device mlx4_0 not found
[1] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-scm-roe-mlx4_0-2
[0] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-mthca0-1u
node001-mic1:3fbc:64958700: 30111 us(1000 us): open_hca: device mlx4_0 not found
node001-mic0:43df:5cbb7700: 1617 us(1107 us): open_hca: device mthca0 not found
[1] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-mcm-1
[0] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-mthca0-2u
node001-mic0:43df:5cbb7700: 2309 us(692 us): open_hca: device mthca0 not found
[0] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-cma-roe-eth2
CMA: unable to open /dev/infiniband/rdma_cm
[0] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-cma-roe-eth3
CMA: unable to open /dev/infiniband/rdma_cm
[0] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-scm-roe-mlx4_0-1
node001-mic1:3fbc:64958700: 124 us(124 us): open_hca: device mlx4_0 not found
[1] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-mcm-2
node001-mic1:3fbc:64958700: 509 us(385 us): open_hca: device mlx4_0 not found
[1] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-scif0
node001-mic0:43df:5cbb7700: 39397 us(8912 us): open_hca: device mlx4_0 not found
[0] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-scm-roe-mlx4_0-2
node001-mic0:43df:5cbb7700: 40925 us(1528 us): open_hca: device mlx4_0 not found
[0] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-mcm-1
node001-mic0:43df:5cbb7700: 105 us(105 us): open_hca: device mlx4_0 not found
[0] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-mcm-2
node001-mic0:43df:5cbb7700: 712 us(607 us): open_hca: device mlx4_0 not found
[0] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-scif0
[1] MPI startup(): DAPL provider ofa-v2-scif0
[1] MPI startup(): dapl data transfer mode
[0] MPI startup(): DAPL provider ofa-v2-scif0
[0] MPI startup(): dapl data transfer mode
node001-mic0:43df:5cbb7700: 60619 us(19694 us): DAPL ERR reg_mr Cannot allocate memory
[0:node001-mic0][../../dapl_conn_rc.c:1212] error(0x30000): ofa-v2-scif0: could not register memory for internal RDMA buffers: DAT_INSUFFICIENT_RESOURCES()
Assertion failed in file ../../dapl_conn_rc.c at line 1212: 0
node001-mic1:3fbc:64958700: 55664 us(25553 us): DAPL ERR reg_mr Cannot allocate memory
internal ABORT - process 0
[1:node001-mic1][../../dapl_conn_rc.c:1212] error(0x30000): ofa-v2-scif0: could not register memory for internal RDMA buffers: DAT_INSUFFICIENT_RESOURCES()
Assertion failed in file ../../dapl_conn_rc.c at line 1212: 0
internal ABORT - process 0
~ $
Note, that there is no /dev/infiniband/rdma_cm, only /dev/infiniband/uverbs0. I am using MPSS Update 2.
Thanks
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
And some additional information (from one of the MICs):
~ $ cat /etc/dat.conf
# DAT v2.0, v1.2 configuration file
#
# Each entry should have the following fields:
#
# <ia_name> <api_version> <threadsafety> <default> <lib_path> \
# <provider_version> <ia_params> <platform_params>
#
# For uDAPL cma provder, <ia_params> is one of the following:
# network address, network hostname, or netdev name and 0 for port
#
# For uDAPL scm provider, <ia_params> is device name and port
# For uDAPL ucm provider, <ia_params> is device name and port
# For uDAPL iWARP provider, <ia_params> is netdev device name and 0
# For uDAPL iWARP provider, <ia_params> is netdev device name and 0
# For uDAPL RoCE provider, <ia_params> is device name and 0
#
ofa-v2-mlx4_0-1 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mlx4_0 1" ""
ofa-v2-mlx4_0-2 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mlx4_0 2" ""
ofa-v2-ib0 u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "ib0 0" ""
ofa-v2-ib1 u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "ib1 0" ""
ofa-v2-mthca0-1 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mthca0 1" ""
ofa-v2-mthca0-2 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mthca0 2" ""
ofa-v2-ipath0-1 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "ipath0 1" ""
ofa-v2-ipath0-2 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "ipath0 2" ""
ofa-v2-ehca0-2 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "ehca0 1" ""
ofa-v2-iwarp u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "eth2 0" ""
ofa-v2-mlx4_0-1u u2.0 nonthreadsafe default libdaploucm.so.2 dapl.2.0 "mlx4_0 1" ""
ofa-v2-mlx4_0-2u u2.0 nonthreadsafe default libdaploucm.so.2 dapl.2.0 "mlx4_0 2" ""
ofa-v2-mthca0-1u u2.0 nonthreadsafe default libdaploucm.so.2 dapl.2.0 "mthca0 1" ""
ofa-v2-mthca0-2u u2.0 nonthreadsafe default libdaploucm.so.2 dapl.2.0 "mthca0 2" ""
ofa-v2-cma-roe-eth2 u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "eth2 0" ""
ofa-v2-cma-roe-eth3 u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "eth3 0" ""
ofa-v2-scm-roe-mlx4_0-1 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mlx4_0 1" ""
ofa-v2-scm-roe-mlx4_0-2 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mlx4_0 2" ""
ofa-v2-mcm-1 u2.0 nonthreadsafe default libdaplomcm.so.2 dapl.2.0 "mlx4_0 1" ""
ofa-v2-mcm-2 u2.0 nonthreadsafe default libdaplomcm.so.2 dapl.2.0 "mlx4_0 2" ""
ofa-v2-scif0 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "scif0 1" ""
ofa-v2-scif0-u u2.0 nonthreadsafe default libdaploucm.so.2 dapl.2.0 "scif0 1" ""
~ $ ibv_devinfo
hca_id: scif0
transport: iWARP (1)
fw_ver: 0.0.1
node_guid: 0000:00ff:ff00:0100
sys_image_guid: 0000:00ff:ff00:0100
vendor_id: 0x8086
vendor_part_id: 0
hw_ver: 0x1
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 1
port_lid: 1001
port_lmc: 0x00
link_layer: IB~ $ mpirun -V
Intel(R) MPI Library for Linux* OS, Version 4.1.0 Build 20130116
Copyright (C) 2003-2013, Intel Corporation. All rights reserved.
~ $~ $ mpirun -genv I_MPI_DAPL_PROVIDER ofa-v2-scif0 -env I_MPI_DEBUG=5 -env I_MPI_MIC=enable -env I_MPI_FABRICS=shm:dapl -f mpi_hosts -perhost 1 -n 2 $MPI_DIR/bin/IMB-MPI1 PingPong
[0] DAPL startup(): trying to open DAPL provider from I_MPI_DAPL_PROVIDER: ofa-v2-scif0
[1] DAPL startup(): trying to open DAPL provider from I_MPI_DAPL_PROVIDER: ofa-v2-scif0
[0] MPI startup(): DAPL provider ofa-v2-scif0
[0] MPI startup(): shm and dapl data transfer modes
[1] MPI startup(): DAPL provider ofa-v2-scif0
[1] MPI startup(): shm and dapl data transfer modes
node001-mic0:49c6:f3d8700: 59810 us(59810 us): DAPL ERR reg_mr Cannot allocate memory
[0:node001-mic0][../../dapl_conn_rc.c:1212] error(0x30000): ofa-v2-scif0: could not register memory for internal RDMA buffers: DAT_INSUFFICIENT_RESOURCES()
Assertion failed in file ../../dapl_conn_rc.c at line 1212: 0
internal ABORT - process 0
node001-mic1:44f7:d6a35700: 31838 us(31838 us): DAPL ERR reg_mr Cannot allocate memory
[1:node001-mic1][../../dapl_conn_rc.c:1212] error(0x30000): ofa-v2-scif0: could not register memory for internal RDMA buffers: DAT_INSUFFICIENT_RESOURCES()
Assertion failed in file ../../dapl_conn_rc.c at line 1212: 0
internal ABORT - process 0
~ $
~ $ free
total used free shared buffers
Mem: 7881940 1555904 6326036 0 0
-/+ buffers: 1555904 6326036
Swap: 0 0 0
~ $[root@node001 ~]# cat /sys/class/mic/mic0/cmdline
root=ramfs console=hvc0 reg_cache=1 huge_page=1 watchdog=1 watchdog_auto_reboot=0 crash_dump=0 p2p=1 p2p_proxy=1 pm_qos_cpu_dma_lat=75 micpm=cpufreq_on;corec6_off;pc3_on;pc6_on
[root@node001 ~]#
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Update: it happens only when non-root user starts mpirun. If root starts it, then it will finish successfully.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi all,
Was there ever a solution to this for non-root users?
As root it works, non-root it fails with the same error.
Cheers,
Gerald
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Gerald,
You are getting the same error now? What is funny is that, earlier today, I was reading through old forum posts and I came across https://software.intel.com/en-us/forums/topic/498731, where the user was having difficulty getting dapl to work. It turned out in that case that the problem was the memory limits set in /etc/security/limits.conf. The messages in Taras' post show that MPI startup is selecting the right dapl provider - ofa-v2-scif0 but then memory allocation fails. Perhaps instead of "free" to show the absolute memory available, it would be good to try "ulimit" to see if the user has a limit that is too small for the job.
Frances
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Frances,
Thanks for the prompt reply.
Yes, that's exactly what we thought - changing the ulimit.conf might solve it, which was how the following post came about:
https://software.intel.com/en-us/forums/topic/517426
I added changes to the MICs by editing /var/mpss/common/etc/security/limits.conf but they seem to be ignored.
I have added:
* soft memlock unlimited
* hard memlock unlimited
* soft core 0
* hard core 0
but still nothing. I did think that the following might also work, but nothing:
* - memlock unlimited
* - core 0
Is /var/mpss/common/etc/security/limits.conf the right place to make those changes? It's probably something obvious :-)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Two different methods for setting the limit value are given in https://software.intel.com/en-us/forums/topic/517426 and https://software.intel.com/forums/topic/404071. The problem with the approach here may have been the need to set UsePam to yes as shown in issue 517426.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page