HYDT_bscu_wait_for_completion

Eh_C_ · ‎02-11-2013

I am getting the following message arbitrarily at times when running a parallel job using OpenFoam Application complied by Icc and compiler and intel mpi. When I have one running job, it is fine, but all the jobs crashe for multiple running jobs.

lsb_launch(): Failed while waiting for tasks to finish.
[mpiexec@ys0271] HYDT_bscu_wait_for_completion (./tools/bootstrap/utils/bscu_wait.c:101): one of the processes terminated badly; aborting
[mpiexec@ys0271] HYDT_bsci_wait_for_completion (./tools/bootstrap/src/bsci_wait.c:18): bootstrap device returned error waiting for completion
[mpiexec@ys0271] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:521): bootstrap server returned error waiting for completion
[mpiexec@ys0271] main (./ui/mpich/mpiexec.c:548): process manager error waiting for completion

James_T_Intel · ‎02-11-2013

Hi,

Unfortunately, this error alone gives very little information. Does the crash happen at the beginning, or during execution? Is this on a single node, or multiple nodes? When you say job, do you mean MPI ranks/processes, or do you mean separate sets of linked processes?

Also, please run with I_MPI_DEBUG=5 and send the output.

Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools

Eh_C_ · ‎02-11-2013

Thanks. I attached the out put of the job. There are different MPI jobs which is running simultaneously by the same user. I think there is confliction between them. All of the jobs carshed and only one of them keeps execution.

James_T_Intel · ‎02-11-2013

Hi,

It is certainly possible that there is a conflict between the jobs, if there is a resource that is used by the first job but unavailable and needed by other jobs. The output you sent has no MPI debug output, it appears to be LSF output only. Are the jobs being run in the same folder? If so, and all are using the same launch script, the host file is being overwritten. I would recommend letting mpirun detect the host file provided by LSF rather than building a separate one if possible.

Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools

Eh_C_ · ‎02-11-2013

All the jobs are in the same folder but I used different names for host files. I wonder how it is possible to remove the conflicting between them.

James_T_Intel · ‎02-11-2013

Hi,

I would recommend putting them in different folders. OpenFOAM* could be using a file in that folder that is common between the jobs.

Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools

Eh_C_ · ‎02-11-2013

I put them in different folders, and again jobs crashed.

James_T_Intel · ‎02-11-2013

Hi,

Are the jobs running on different hosts? Please try adding

[plain]-genv I_MPI_DEBUG 5[/plain]

to the mpirun options and send the mpirun output.

Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools

Eh_C_ · ‎02-11-2013

I checked. all the jobs are running on different hosts. I attached log file as well as damping error and ouput.

James_T_Intel · ‎02-13-2013

Hi,

I'm still not seeing anything really indicative of the problem in the output. Does the crash ever occur with only a single job running? Are there common files used by OpenFOAM that could be locked by one job and thus inaccessible to a different job. Does this crash happen with any other programs?

Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools

Eh_C_ · ‎02-13-2013

I used OF of many different clusters and was fine. now, the issue which is different is that the platform is LSF and I compiled with ICC and itel_mpi. All the jobs unless one crash and only that one keeps running. It may happen after 2 or 3 hours running.

James_T_Intel · ‎02-13-2013

Hi,

On this cluster, can you run outside of LSF? Have you been able to run with ICC and IMPI on a different cluster? Let's try to isolate one change at a time.

Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools

Eh_C_ · ‎02-13-2013

I do not think it is possible to change platform but I can compile OF with gcc and mpi_cc.

I am also very suspicious that this may be a memory problem but I do not know how to configure it.

James_T_Intel · ‎02-13-2013

If the jobs are running on separate nodes, there should be no conflicts in memory usage. Can you send me the case you are using with OpenFOAM, along with the OpenFOAM configuration you're using? I'll see if I can replicate it here as well.

Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools

Eh_C_ · ‎02-13-2013

James, Now I am sure that the problem is a consequence of running several jobs on cluster simultaneously. There are two scenarios:1. memory for the user is limited and it is a problem with stack. 2. Jobs are conflicting and it is a problem with job scheduler. I do not have idea how to isolate this problem. I think it takes long time to replicate it on your cluster.

James_T_Intel · ‎02-14-2013

Hi,

Does this occur with any other programs? One of the clusters I use has LSF, and is not showing this problem, but I don't use OpenFOAM.

Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools

Eh_C_ · ‎05-30-2013

Hi

Our setup do not support launching job using hydra. That essentially means many of the ports hydra needs is not open.

How it is possible to launch the parallel job with out hydra?

James_T_Intel · ‎05-31-2013

Hi,

To use MPD instead, set I_MPI_PROCESS_MANAGER=mpd and use mpirun. This will start an MPD ring on the hosts involved in the job, run the job, and stop the MPD ring. Or, if you set up the MPD ring across your entire cluster ahead of time, you can use mpiexec instead.

James.

Eh_C_ · ‎05-31-2013

Thanks James. I run mpdallexit and no mpd was running on the host and I export I_MPI_PROCESS_MANAGER=mpd. But mpirun was still running and there was no error. It shows that it was still using hydra. How I can check wether it is running with mpd ring.

James_T_Intel · ‎06-03-2013

Hi,

If you see "mpiexec.hydra" in the output from ps, then you are using Hydra. If you just see "mpiexec", then you are using MPD. Also, as I previously said, you can launch the MPD ring ahead of time, and then use mpiexec to launch using the MPD ring.

Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools

Eh_C_ · ‎06-04-2013

Hi James,

I have compiled my code by Icc and intelmpi and can run by mpirun. Using the LSF platform, bsub -a openmpi -n number_cpus mpirun.lsf a.out , it is not running and I face the error.