- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi
Is there any way of diagnosing what might be causing the following error?
PANIC in ../../src/mpid/ch3/channels/nemesis/netmod/ofa/cm/dapl/common/dapl_evd_cq_async_error_callb.c:71:dapl_evd_cq_async_error_callback
NULL == context
Intel MPI 2018.4 run using release_mt version of libmpi.so
I_MPI_FABRICS=shm:ofa
Running with MPI_THREAD_MULTIPLE on Centos 7.2 with mlx_5 hardware
Thanks
Jamil
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Jamil,
Coud you please try I_MPI_FABRICS=shm:dapl with the same scenario?
BR,
Dmitry
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Dimitry
It works as expected with no errors when using the dapl.
Jamil
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Dimitry
I have a case that fails with I_MPI_FABRICS=shm:dapl
Here is the error
prod-0026:UCM:2bfae:c5ca9700: 18942905 us(18942905 us!!!): dapl async_event CQ (0x43d68f0) ERR 0
prod-0026:UCM:2bfae:c5ca9700: 18942927 us(22 us): -- dapl_evd_cq_async_error_callback (0x42ec630, 0x4329010, 0x7fa2c5ca8d30, 0x43d68f0)
prod-0026:UCM:2bfae:c5ca9700: 18942944 us(17 us): dapl async_event QP (0x42abda0) Event 1
Could this be caused by an OFED bug. The system is running Melanox OFED.3.2
Jamil
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Dimitry
Switching to Intel 2019u2 and using the release version of libmpi seems to work. The release_mt version of libmpi causes a deadlock.
I will submit a separate post with the stack trace as I looks like a bug in intel mpi.
Thanks
Jamil
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page