I am facing the following errors on intel/2018.2, with intelmpi/2018.2 using mpiexec to submit my cluster simulations.
dapl async_event CQ (0x1750ff0) ERR 0
dapl_evd_cq_async_error_callback (0x169ada0, 0x16cf460, 0x2ab4fecb9d30, 0x1750ff0)
dapl async_event QP (0x1fdacc0) Event 1
After this point my runs terminate. Any assistance with resolving this error would be much appreciated.
This error is the result of a CQ overrun (completion queue is not large enough for data queue processing).
IBV_EVENT_CQ_ERR CQ is in error (CQ overrun), IBV_EVENT_QP_FATAL Error occurred on a QP and it transitioned to error state
The the default CQ (EVD) size can be increased. What is the size of your job and what MPI and/or DAPL tuning parameters are being set?
thank you Arlin
the command I run is: mpiexec -mpi ./jobscript (no particular parameters).
I have experimented with some dapl parameters such as the following but they haven't solved the problem.
mpiexec -genv I_MPI_FABRICS shm:ofa -mpi
This problem indeed appears as soon as I increase the size of my jobs. Before that point my jobs take 2 hours on 40 nodes Skylake procs with 40cores per node, 202GBRAM per node.