Intel® HPC Toolkit
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.
2153 Discussions

Problems with IB and IntelMPI

ivcores
Beginner
606 Views
Dear HPC Forum.
I'm working with Intel MPI over a Linux cluster with InfiniBand network and I'm
having problems if I use more than 16 processes.

I execute with the next command

/home/apps/intel/impi/4.0.0.017/bin64/mpirun -r ssh -f mpd.hosts -n 36 -env I_MPI_DEBUG 5 bt.B.36

but i get the following error:


[34] MPI startup(): DAPL provider OpenIB-cma
[33] MPI startup(): DAPL provider OpenIB-cma
...
[11] MPI startup(): shm and dapl data transfer modes
[28] MPI startup(): shm and dapl data transfer modes
...
[0] MPI startup(): static connections storm algo
compute-0-3.local:25871: dapl_cma_active: PATH_RECORD_ERR, retries(15) exhausted, DST 10.2.1.7,32304
compute-0-3.local:25871: dapl_cma_active: PATH_RECORD_ERR, retries(15) exhausted, DST 10.2.1.7,32305
compute-0-3.local:25871: dapl_cma_active: PATH_RECORD_ERR, retries(15) exhausted, DST 10.2.1.7,32303
compute-0-3.local:25871: dapl_cma_active: PATH_RECORD_ERR, retries(15) exhausted, DST 10.2.1.7,32306
[0:compute-0-3] unexpected DAPL event 0x4008
Assertion failed in file ../../dapl_module_init.c at line 3896: 0
internal ABORT - process 0
...
compute-0-2.local:27022: dapl_cma_active: PATH_RECORD_ERR, retries(15) exhausted, DST 10.2.1.9,26580
rank 31 in job 1 compute-0-0.local_47979 caused collective abort of all ranks
exit status of rank 31: return code 1


My /etc/dat.conf is:

OpenIB-cma u1.2 nonthreadsafe default libdaplcma.so.1 dapl.1.2 "ib0 0" ""
OpenIB-cma-1 u1.2 nonthreadsafe default libdaplcma.so.1 dapl.1.2 "ib1 0" ""
OpenIB-mthca0-1 u1.2 nonthreadsafe default libdaplscm.so.1 dapl.1.2 "mthca0 1" ""
OpenIB-mthca0-2 u1.2 nonthreadsafe default libdaplscm.so.1 dapl.1.2 "mthca0 2" ""
OpenIB-mlx4_0-1 u1.2 nonthreadsafe default libdaplscm.so.1 dapl.1.2 "mlx4_0 1" ""
OpenIB-mlx4_0-2 u1.2 nonthreadsafe default libdaplscm.so.1 dapl.1.2 "mlx4_0 2" ""
OpenIB-ipath0-1 u1.2 nonthreadsafe default libdaplscm.so.1 dapl.1.2 "ipath0 1" ""
OpenIB-ipath0-2 u1.2 nonthreadsafe default libdaplscm.so.1 dapl.1.2 "ipath0 2" ""
OpenIB-ehca0-1 u1.2 nonthreadsafe default libdaplscm.so.1 dapl.1.2 "ehca0 1" ""
OpenIB-iwarp u1.2 nonthreadsafe default libdaplcma.so.1 dapl.1.2 "eth2 0" ""
ofa-v2-mlx4_0-1 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mlx4_0 1" ""
ofa-v2-mlx4_0-2 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mlx4_0 2" ""
ofa-v2-ib0 u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "ib0 0" ""
ofa-v2-ib1 u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "ib1 0" ""
ofa-v2-mthca0-1 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mthca0 1" ""
ofa-v2-mthca0-2 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mthca0 2" ""
ofa-v2-ipath0-1 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "ipath0 1" ""
ofa-v2-ipath0-2 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "ipath0 2" ""
ofa-v2-ehca0-1 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "ehca0 1" ""
ofa-v2-iwarp u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "eth2 0" ""
ofa-v2-mlx4_0-1u u2.0 nonthreadsafe default libdaploucm.so.2 dapl.2.0 "mlx4_0 1" ""
ofa-v2-mlx4_0-2u u2.0 nonthreadsafe default libdaploucm.so.2 dapl.2.0 "mlx4_0 2" ""
ofa-v2-mthca0-1u u2.0 nonthreadsafe default libdaploucm.so.2 dapl.2.0 "mthca0 1" ""
ofa-v2-mthca0-2u u2.0 nonthreadsafe default libdaploucm.so.2 dapl.2.0 "mthca0 2" ""

and I solved this problem with these settings:

/home/apps/intel/impi/4.0.0.017/bin64/mpirun -r ssh -f mpd.hosts -n 36 -env I_MPI_FABRICS_LIST "ofa,dapl,tcp" -env I_MPI_DEBUG 5 bt.B.36

In this case the output is:

[0] MPI startup(): shm and ofa data transfer modes
...
[35] MPI startup(): shm and ofa data transfer modes
[0] MPI startup(): I_MPI_DEBUG=5
[0] MPI startup(): I_MPI_FABRICS_LIST=ofa,dapl,tcp
[1] MPI Startup(): set domain to {3,11} on node compute-0-5.local
...
[31] MPI Startup(): set domain to {6,14} on node compute-0-7.local
[0] Rank Pid Node name Pin cpu
[0] 0 32347 compute-0-5.local {1,9}
...
[0] 35 29342 compute-0-1.local {4,6,12,14}
rank 15 in job 1 compute-0-0.local_37285 caused collective abort of all ranks
exit status of rank 15: killed by signal 9


I also try to execute with the flag -env I_MPI_USE_DYNAMIC_CONNECTIONS 0 but the result is the same.

Finally, the output of ibv_devinfo is:

hca_id: qib0
transport: InfiniBand (0)
fw_ver: 0.0.0
node_guid: 0011:7500:00ff:8829
sys_image_guid: 0011:7500:00ff:8829
vendor_id: 0x1175
vendor_part_id: 29216
hw_ver: 0x2
board_id: InfiniPath_QLE7240
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 2048 (4)
sm_lid: 9
port_lid: 8
port_lmc: 0x00


With OpenMPI it works correctly. Do you have any idea of what the problem is?

Thanks!
0 Kudos
3 Replies
Andres_M_Intel4
Employee
606 Views
Hope it helps, I have found the following information while searching the web for your error message.
-- Andres
Workaround in Case of IntelMPI/uDAPL Error "unexpected DAPL event 4008"
The following error may occur on rare occasions with IntelMPI/uDAPL:"unexpected DAPL event 4008 from ..."
To work around this, add the following to your mpirun command:
-genv I_MPI_USE_DYNAMIC_CONNECTIONS 0
This problem is caused by a limitation in Intel MPI/uDAPL's dynamic connection mechanism when MPI
processes are not sufficiently attentive to incoming interconnect traffic.
0 Kudos
ivcores
Beginner
606 Views

Hi Andres,
Thanks for your reply. Unfurtunately the problem persists. I execute the commmand

/home/apps/intel/impi/4.0.0.017/bin64/mpirun -r ssh -f mpd.hosts -genv I_MPI_USE_DYNAMIC_CONNECTIONS 0 -n 36 -env I_MPI_FABRICS_LIST "ofa,dapl,tcp" -env I_MPI_DEBUG 5 bt.B.36

and the error is the same:

...
[0] 34 31432 compute-0-1.local {0,2,8,10}
[0] 35 31433 compute-0-1.local {4,6,12,14}
rank 28 in job 1 compute-0-0.local_48003 caused collective abort of all ranks
exit status of rank 28: killed by signal 9


0 Kudos
Gergana_S_Intel
Employee
606 Views

Hi ivcores,

Unfortunately, that's a pretty generic error. All it means is, one of your MPI processes failed, and that took down the entire application run.

This could be caused by the use of an older provider (similar to a different issue on this forum). Could I see the output of your ibstat tool? Also, what version of OFED do you have installed (running ofed_info should tell you). We recommend the latest OFED 1.5.1, if possible.

Regards,
~Gergana

0 Kudos
Reply