Problems with IB and IntelMPI

ivcores · ‎05-11-2010

Dear HPC Forum.
I'm working with Intel MPI over a Linux cluster with InfiniBand network and I'm
having problems if I use more than 16 processes.

I execute with the next command

/home/apps/intel/impi/4.0.0.017/bin64/mpirun  -r ssh -f mpd.hosts -n 36 -env I_MPI_DEBUG 5 bt.B.36

but i get the following error:


[34] MPI startup(): DAPL provider OpenIB-cma
[33] MPI startup(): DAPL provider OpenIB-cma
...
[11] MPI startup(): shm and dapl data transfer modes
[28] MPI startup(): shm and dapl data transfer modes
...
[0] MPI startup(): static connections storm algo
compute-0-3.local:25871: dapl_cma_active: PATH_RECORD_ERR, retries(15) exhausted, DST 10.2.1.7,32304
compute-0-3.local:25871: dapl_cma_active: PATH_RECORD_ERR, retries(15) exhausted, DST 10.2.1.7,32305
compute-0-3.local:25871: dapl_cma_active: PATH_RECORD_ERR, retries(15) exhausted, DST 10.2.1.7,32303
compute-0-3.local:25871: dapl_cma_active: PATH_RECORD_ERR, retries(15) exhausted, DST 10.2.1.7,32306
[0:compute-0-3] unexpected DAPL event 0x4008
Assertion failed in file ../../dapl_module_init.c at line 3896: 0
internal ABORT - process 0
...
compute-0-2.local:27022: dapl_cma_active: PATH_RECORD_ERR, retries(15) exhausted, DST 10.2.1.9,26580
rank 31 in job 1  compute-0-0.local_47979   caused collective abort of all ranks
  exit status of rank 31: return code 1 


My /etc/dat.conf is:

OpenIB-cma u1.2 nonthreadsafe default libdaplcma.so.1 dapl.1.2 "ib0 0" ""
OpenIB-cma-1 u1.2 nonthreadsafe default libdaplcma.so.1 dapl.1.2 "ib1 0" ""
OpenIB-mthca0-1 u1.2 nonthreadsafe default libdaplscm.so.1 dapl.1.2 "mthca0 1" ""
OpenIB-mthca0-2 u1.2 nonthreadsafe default libdaplscm.so.1 dapl.1.2 "mthca0 2" ""
OpenIB-mlx4_0-1 u1.2 nonthreadsafe default libdaplscm.so.1 dapl.1.2 "mlx4_0 1" ""
OpenIB-mlx4_0-2 u1.2 nonthreadsafe default libdaplscm.so.1 dapl.1.2 "mlx4_0 2" ""
OpenIB-ipath0-1 u1.2 nonthreadsafe default libdaplscm.so.1 dapl.1.2 "ipath0 1" ""
OpenIB-ipath0-2 u1.2 nonthreadsafe default libdaplscm.so.1 dapl.1.2 "ipath0 2" ""
OpenIB-ehca0-1 u1.2 nonthreadsafe default libdaplscm.so.1 dapl.1.2 "ehca0 1" ""
OpenIB-iwarp u1.2 nonthreadsafe default libdaplcma.so.1 dapl.1.2 "eth2 0" ""
ofa-v2-mlx4_0-1 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mlx4_0 1" ""
ofa-v2-mlx4_0-2 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mlx4_0 2" ""
ofa-v2-ib0 u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "ib0 0" ""
ofa-v2-ib1 u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "ib1 0" ""
ofa-v2-mthca0-1 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mthca0 1" ""
ofa-v2-mthca0-2 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mthca0 2" ""
ofa-v2-ipath0-1 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "ipath0 1" ""
ofa-v2-ipath0-2 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "ipath0 2" ""
ofa-v2-ehca0-1 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "ehca0 1" ""
ofa-v2-iwarp u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "eth2 0" ""
ofa-v2-mlx4_0-1u u2.0 nonthreadsafe default libdaploucm.so.2 dapl.2.0 "mlx4_0 1" ""
ofa-v2-mlx4_0-2u u2.0 nonthreadsafe default libdaploucm.so.2 dapl.2.0 "mlx4_0 2" ""
ofa-v2-mthca0-1u u2.0 nonthreadsafe default libdaploucm.so.2 dapl.2.0 "mthca0 1" ""
ofa-v2-mthca0-2u u2.0 nonthreadsafe default libdaploucm.so.2 dapl.2.0 "mthca0 2" ""

and I solved this problem with these settings:

/home/apps/intel/impi/4.0.0.017/bin64/mpirun  -r ssh -f mpd.hosts -n 36 -env I_MPI_FABRICS_LIST "ofa,dapl,tcp" -env I_MPI_DEBUG 5 bt.B.36

In this case the output is:

[0] MPI startup(): shm and ofa data transfer modes
...
[35] MPI startup(): shm and ofa data transfer modes
[0] MPI startup(): I_MPI_DEBUG=5
[0] MPI startup(): I_MPI_FABRICS_LIST=ofa,dapl,tcp
[1] MPI Startup(): set domain to {3,11} on node compute-0-5.local
...
[31] MPI Startup(): set domain to {6,14} on node compute-0-7.local
[0] Rank    Pid      Node name          Pin cpu
[0] 0       32347    compute-0-5.local  {1,9}
...
[0] 35      29342    compute-0-1.local  {4,6,12,14}
rank 15 in job 1  compute-0-0.local_37285   caused collective abort of all ranks
  exit status of rank 15: killed by signal 9 


I also try to execute with the flag -env I_MPI_USE_DYNAMIC_CONNECTIONS 0 but the result is the same.

Finally, the output of ibv_devinfo is: 

hca_id:	qib0
	transport:			InfiniBand (0)
	fw_ver:				0.0.0
	node_guid:			0011:7500:00ff:8829
	sys_image_guid:			0011:7500:00ff:8829
	vendor_id:			0x1175
	vendor_part_id:			29216
	hw_ver:				0x2
	board_id:			InfiniPath_QLE7240
	phys_port_cnt:			1
		port:	1
			state:			PORT_ACTIVE (4)
			max_mtu:		4096 (5)
			active_mtu:		2048 (4)
			sm_lid:			9
			port_lid:		8
			port_lmc:		0x00


With OpenMPI it works correctly. Do you have any idea of what the problem is?

Thanks!

Andres_M_Intel4 · ‎05-11-2010

Hope it helps, I have found the following information while searching the web for your error message.

-- Andres

Workaround in Case of IntelMPI/uDAPL Error "unexpected DAPL event 4008"

The following error may occur on rare occasions with IntelMPI/uDAPL:"unexpected DAPL event 4008 from ..."

To work around this, add the following to your mpirun command:

-genv I_MPI_USE_DYNAMIC_CONNECTIONS 0

This problem is caused by a limitation in Intel MPI/uDAPL's dynamic connection mechanism when MPI

processes are not sufficiently attentive to incoming interconnect traffic.

ivcores · ‎05-11-2010

Hi Andres,
Thanks for your reply. Unfurtunately the problem persists. I execute the commmand

/home/apps/intel/impi/4.0.0.017/bin64/mpirun -r ssh -f mpd.hosts -genv I_MPI_USE_DYNAMIC_CONNECTIONS 0 -n 36 -env I_MPI_FABRICS_LIST "ofa,dapl,tcp" -env I_MPI_DEBUG 5 bt.B.36

and the error is the same:

...
[0] 34 31432 compute-0-1.local {0,2,8,10}
[0] 35 31433 compute-0-1.local {4,6,12,14}
rank 28 in job 1 compute-0-0.local_48003 caused collective abort of all ranks
exit status of rank 28: killed by signal 9

Gergana_S_Intel · ‎05-11-2010

Hi ivcores,

Unfortunately, that's a pretty generic error. All it means is, one of your MPI processes failed, and that took down the entire application run.

This could be caused by the use of an older provider (similar to a different issue on this forum). Could I see the output of your ibstat tool? Also, what version of OFED do you have installed (running ofed_info should tell you). We recommend the latest OFED 1.5.1, if possible.

Regards,
~Gergana