Community
cancel
Showing results for 
Search instead for 
Did you mean: 
163 Views

dapl async_event QP

Hello

I am facing the following errors on intel/2018.2, with intelmpi/2018.2 using mpiexec to submit my cluster simulations.

dapl async_event CQ (0x1750ff0) ERR 0
dapl_evd_cq_async_error_callback (0x169ada0, 0x16cf460, 0x2ab4fecb9d30, 0x1750ff0)
dapl async_event QP (0x1fdacc0) Event 1

After this point my runs terminate. Any assistance with resolving this error would be much appreciated.

Alexandra

0 Kudos
2 Replies
Arlin_D_Intel
Employee
163 Views

This error is the result of a CQ overrun (completion queue is not large enough for data queue processing).  

IBV_EVENT_CQ_ERR CQ is in error (CQ overrun), IBV_EVENT_QP_FATAL Error occurred on a QP and it transitioned to error state

The the default CQ (EVD) size can be increased. What is the size of your job and what MPI and/or DAPL tuning parameters are being set? 

 

 

163 Views

thank you Arlin

the command I run is: mpiexec -mpi ./jobscript (no particular parameters).

I have experimented with some dapl parameters such as the following but they haven't solved the problem.

mpiexec -genv I_MPI_FABRICS shm:ofa -mpi

This problem indeed appears as soon as I increase the size of my jobs. Before that point my jobs take 2 hours on 40 nodes Skylake procs with 40cores per node, 202GBRAM per node.

thank you,

Alexandra