Intel® MPI Library
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.

HYDT_bscu_wait_for_completion

Eh_C_
Beginner
4,155 Views

I am getting the following message arbitrarily at times when running a parallel job using OpenFoam Application complied by Icc and  compiler and intel mpi. When I have one running job, it is fine, but all the jobs crashe for multiple running jobs.

lsb_launch(): Failed while waiting for tasks to finish.
[mpiexec@ys0271] HYDT_bscu_wait_for_completion (./tools/bootstrap/utils/bscu_wait.c:101): one of the processes terminated badly; aborting
[mpiexec@ys0271] HYDT_bsci_wait_for_completion (./tools/bootstrap/src/bsci_wait.c:18): bootstrap device returned error waiting for completion
[mpiexec@ys0271] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:521): bootstrap server returned error waiting for completion
[mpiexec@ys0271] main (./ui/mpich/mpiexec.c:548): process manager error waiting for completion

0 Kudos
31 Replies
James_T_Intel
Moderator
1,180 Views

Hi,

What does

[plain]-a openmpi[/plain]

do?  The Intel® MPI Library is not compatible with OpenMPI.  When I use LSF*, I run with a job script similar to the attached file(renamed to .txt for attaching), using

[plain]bsub -W <time> < run.sh[/plain]

Try something similar to this and see if it works.

Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools

0 Kudos
Eh_C_
Beginner
1,180 Views

Hi James, Thanks.

I used -a intelmpi. The problem is that I get frequently the error below ys1466:5738:ee08e700: 22629011 us(22629011 us!!!):  CONN_REQUEST: SOCKOPT ERR Connection timed out -> 10.16.25.59 27407 - RETRYING... 5
ys1466:5738:ee08e700: 43628981 us(20999970 us!!!):  CONN_REQUEST: SOCKOPT ERR Connection timed out -> 10.16.25.59 27407 - RETRYING... 5
ys1466:5738:ee08e700: 64629008 us(21000027 us!!!):  CONN_REQUEST: SOCKOPT ERR Connection timed out -> 10.16.25.59 27407 - RETRYING... 5
ys1466:5738:ee08e700: 85628981 us(20999973 us!!!):  CONN_REQUEST: SOCKOPT ERR Connection timed out -> 10.16.25.59 27407 - RETRYING... 5
ys1466:5738:ee08e700: 106628981 us(21000000 us!!!):  CONN_REQUEST: SOCKOPT ERR Connection timed out -> 10.16.25.59 27407 - RETRYING... 5
When I use your script, It asks for the number of processors. Like mpirun -n $number of processors and can not run it with just mpirun.

0 Kudos
James_T_Intel
Moderator
1,180 Views

Hi,

That appears to be a problem with InfiniBand*.  Please check your IB connections.  You can use ibdiagnet to do this.

Add a -n <numranks> to my script if needed.

Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools

0 Kudos
Eh_C_
Beginner
1,180 Views

I used ibdiagnet  and get the error below. I wonder whether it is the reason I can not launch jobs using mpirun.lsf
Plugin Name                                   Result     Comment
libibdiagnet_cable_diag_plugin                Succeeded  Plugin loaded
libibdiagnet_cable_diag_plugin-2.1.1          Failed     Plugin options issue - Option "get_cable_info" from requester "Cable Diagnostic (Plugin)" already exists in requester "Cable Diagnostic (Plugin)"

---------------------------------------------
Discovery
-E- Failed to initialize
---------------------------------------------
Summary
-I- Stage                     Warnings   Errors     Comment   
-I- Discovery                                       NA
-I- Lids Check                                      NA
-I- Links Check                                     NA
-I- Subnet Manager                                  NA
-I- Port Counters                                   NA
-I- Nodes Information                               NA
-I- Speed / Width checks                            NA
-I- Partition Keys                                  NA
-I- Alias GUIDs                                     NA

-I- You can find detailed errors/warnings in: /var/tmp/ibdiagnet2/ibdiagnet2.log




-E- A fatal error occurred, exiting...

0 Kudos
James_T_Intel
Moderator
1,180 Views

Hi,

Yes, that is very likely part of the problem.  Try running with I_MPI_FABRICS=shm:tcp to use sockets instead of InfiniBand*, and that will help determine if there is another problem.  Once you have the InfiniBand* working correctly, try again without setting the fabric.

Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools

0 Kudos
Eh_C_
Beginner
1,180 Views

Hi James,

I compile my code with IBMPE and it launched with mpirun.lsf, but still wondering why it can not launch with intelmpi.

Thanks

0 Kudos
James_T_Intel
Moderator
1,180 Views

Is mpirun.lsf using InfiniBand*?

James.

0 Kudos
Eh_C_
Beginner
1,180 Views

Both of them mpirun.lsf and mpitun are using InfiniBand*. I wonder whether intelmpi support IBM-PE.

Thanks

0 Kudos
James_T_Intel
Moderator
1,180 Views

Are you using the IBM* MPI implementation to compile, and the Intel® MPI Library to run?  That is not supported.  You will need to compile and run with the same implementation.

James.

0 Kudos
Eh_C_
Beginner
1,180 Views

I compiled with intelmpi and try to launch it using mpirun.lsf

0 Kudos
James_T_Intel
Moderator
1,180 Views

Hi,

Please try running with I_MPI_FABRICS=shm:tcp and let me know if this works.  Also, please attach your /etc/dat.conf file.

Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools

0 Kudos
Reply