Is there any way of diagnosing what might be causing the following error?
PANIC in ../../src/mpid/ch3/channels/nemesis/netmod/ofa/cm/dapl/common/dapl_evd_cq_async_error_callb.c:71:dapl_evd_cq_async_error_callback
NULL == context
Intel MPI 2018.4 run using release_mt version of libmpi.so
Running with MPI_THREAD_MULTIPLE on Centos 7.2 with mlx_5 hardware
I have a case that fails with I_MPI_FABRICS=shm:dapl
Here is the error
prod-0026:UCM:2bfae:c5ca9700: 18942905 us(18942905 us!!!): dapl async_event CQ (0x43d68f0) ERR 0
prod-0026:UCM:2bfae:c5ca9700: 18942927 us(22 us): -- dapl_evd_cq_async_error_callback (0x42ec630, 0x4329010, 0x7fa2c5ca8d30, 0x43d68f0)
prod-0026:UCM:2bfae:c5ca9700: 18942944 us(17 us): dapl async_event QP (0x42abda0) Event 1
Could this be caused by an OFED bug. The system is running Melanox OFED.3.2
Switching to Intel 2019u2 and using the release version of libmpi seems to work. The release_mt version of libmpi causes a deadlock.
I will submit a separate post with the stack trace as I looks like a bug in intel mpi.