I am trying to run an application with intel mpi and LSF on our cluster but I still got trouble with it. I have installed the Intel Cluster Studio XE 2013 for Linux and Platform LSF 7.
The application is an extention of RAMS - High Resolution Forecast Europe, Greece, Athens compiled with HDF5, Intel fortran, and Intel mpi. The application normally runs for 6 hours. But sometime, we will get the errors like below:
[mpiexec@cn104] stdoe_cb (./ui/utils/uiu.c:385): assert (!closed) failed
[mpiexec@cn104] control_cb (./pm/pmiserv/pmiserv_cb.c:831): error in the UI defined callback
[mpiexec@cn104] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
[mpiexec@cn104] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:430): error waiting for event
[mpiexec@cn104] main (./ui/mpich/mpiexec.c:847): process manager error waiting for completion
The error happens very often but is not repeatable. Retrying the error run with the same settings will pass.
The bsub command:
$ bsub -x -n 144 -oo ini.log -eo error.log -K 'mpirun -np 144 ./iclams_opt -f ICLAMSIN'
Do you have any idea?
Thanks in advance,
Thank you for your reply. I did not specify the I_MPI_FABRICS when I was using the mpirun. But since we are using the InfiniBand with Mellanox Switches, I think the fabric is ofa.
I will try the I_MPI_FABRICS=shm:tcp with the mpirun. BTW, if I switch the fabric to tcp, will it lower down the performance of the software? We hope the software can finish computing in 6-7 hours.
Using TCP will almost certainly lower the performance. However, for debugging purposes, we're trying to isolate the cause of the problem, and whether or not it fails under TCP helps to do that.
I just find that this issue have not appeared for at least 5 days since I changed number of cores from 144 to 160. Before that, I was facing that issue almost every day. We have 16 cores for each node. So do you think the odd number of nodes will cause that issue?
It's possible. If you think that's the concern, try running with different rank placement options, maybe run 144 ranks across 10 nodes instead of 9, and decrease the number of ranks per node using -ppn. Have you seen this error with other applications?