Intel® MPI Library
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.
2154 Discussions

Intel MPI with LSF got stdoe_cb assert (!closed) failed.

Tingyang_X_
Beginner
1,306 Views

Dear all,

I am trying to run an application with intel mpi and LSF on our cluster but I still got trouble with it. I have installed the Intel Cluster Studio XE 2013 for Linux and Platform LSF 7.

The application is an extention of RAMS - High Resolution Forecast Europe, Greece, Athens compiled with HDF5, Intel fortran, and Intel mpi. The application normally runs for 6 hours. But sometime, we will get the errors like below:

[mpiexec@cn104] stdoe_cb (./ui/utils/uiu.c:385): assert (!closed) failed
[mpiexec@cn104] control_cb (./pm/pmiserv/pmiserv_cb.c:831): error in the UI defined callback
[mpiexec@cn104] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
[mpiexec@cn104] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:430): error waiting for event
[mpiexec@cn104] main (./ui/mpich/mpiexec.c:847): process manager error waiting for completion


The error happens very often but is not repeatable. Retrying the error run with the same settings will pass.

The bsub command:

$ bsub -x -n 144 -oo ini.log -eo error.log -K 'mpirun -np 144 ./iclams_opt -f ICLAMSIN'

Do you have any idea?

Thanks in advance,

Tingyang Xu

0 Kudos
7 Replies
James_T_Intel
Moderator
1,306 Views

Being an intermittent error, this will obviously be more difficult to debug.  What fabric are you using?  Does this occur with I_MPI_FABRICS=shm:tcp as well?

0 Kudos
Tingyang_X_
Beginner
1,306 Views

Hello James,

Thank you for your reply. I did not specify the I_MPI_FABRICS when I was using the mpirun. But since we are using the InfiniBand with Mellanox Switches, I think the fabric is ofa.

I will try the I_MPI_FABRICS=shm:tcp with the mpirun. BTW, if I switch the fabric to tcp, will it lower down the performance of the software? We hope the software can finish computing in 6-7 hours.

 

Thanks,

Tingyang Xu

0 Kudos
James_T_Intel
Moderator
1,306 Views

Using TCP will almost certainly lower the performance.  However, for debugging purposes, we're trying to isolate the cause of the problem, and whether or not it fails under TCP helps to do that.

0 Kudos
Tingyang_X_
Beginner
1,306 Views

Thank you for your explanation. Let me try the tcp first.

0 Kudos
Tingyang_X_
Beginner
1,306 Views

Hello James,

I just find that this issue have not appeared for at least 5 days since I changed number of cores from 144 to 160. Before that, I was facing that issue almost every day. We have 16 cores for each node. So do you think the odd number of nodes will cause that issue?

 

Thanks,

Tingyang Xu

0 Kudos
James_T_Intel
Moderator
1,306 Views

It's possible.  If you think that's the concern, try running with different rank placement options, maybe run 144 ranks across 10 nodes instead of 9, and decrease the number of ranks per node using -ppn.  Have you seen this error with other applications?

0 Kudos
Tingyang_X_
Beginner
1,306 Views

I see. I never encountered that issue before because this was the first time that I tried odd number of nodes. Thank you for your help.

0 Kudos
Reply