- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I am getting the following message arbitrarily at times when running a parallel job using OpenFoam Application complied by Icc and compiler and intel mpi. When I have one running job, it is fine, but all the jobs crashe for multiple running jobs.
lsb_launch(): Failed while waiting for tasks to finish.
[mpiexec@ys0271] HYDT_bscu_wait_for_completion (./tools/bootstrap/utils/bscu_wait.c:101): one of the processes terminated badly; aborting
[mpiexec@ys0271] HYDT_bsci_wait_for_completion (./tools/bootstrap/src/bsci_wait.c:18): bootstrap device returned error waiting for completion
[mpiexec@ys0271] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:521): bootstrap server returned error waiting for completion
[mpiexec@ys0271] main (./ui/mpich/mpiexec.c:548): process manager error waiting for completion
Link Copied
- « Previous
-
- 1
- 2
- Next »
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
What does
[plain]-a openmpi[/plain]
do? The Intel® MPI Library is not compatible with OpenMPI. When I use LSF*, I run with a job script similar to the attached file(renamed to .txt for attaching), using
[plain]bsub -W <time> < run.sh[/plain]
Try something similar to this and see if it works.
Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi James, Thanks.
I used -a intelmpi. The problem is that I get frequently the error below ys1466:5738:ee08e700: 22629011 us(22629011 us!!!): CONN_REQUEST: SOCKOPT ERR Connection timed out -> 10.16.25.59 27407 - RETRYING... 5
ys1466:5738:ee08e700: 43628981 us(20999970 us!!!): CONN_REQUEST: SOCKOPT ERR Connection timed out -> 10.16.25.59 27407 - RETRYING... 5
ys1466:5738:ee08e700: 64629008 us(21000027 us!!!): CONN_REQUEST: SOCKOPT ERR Connection timed out -> 10.16.25.59 27407 - RETRYING... 5
ys1466:5738:ee08e700: 85628981 us(20999973 us!!!): CONN_REQUEST: SOCKOPT ERR Connection timed out -> 10.16.25.59 27407 - RETRYING... 5
ys1466:5738:ee08e700: 106628981 us(21000000 us!!!): CONN_REQUEST: SOCKOPT ERR Connection timed out -> 10.16.25.59 27407 - RETRYING... 5
When I use your script, It asks for the number of processors. Like mpirun -n $number of processors and can not run it with just mpirun.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
That appears to be a problem with InfiniBand*. Please check your IB connections. You can use ibdiagnet to do this.
Add a -n <numranks> to my script if needed.
Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I used ibdiagnet and get the error below. I wonder whether it is the reason I can not launch jobs using mpirun.lsf
Plugin Name Result Comment
libibdiagnet_cable_diag_plugin Succeeded Plugin loaded
libibdiagnet_cable_diag_plugin-2.1.1 Failed Plugin options issue - Option "get_cable_info" from requester "Cable Diagnostic (Plugin)" already exists in requester "Cable Diagnostic (Plugin)"
---------------------------------------------
Discovery
-E- Failed to initialize
---------------------------------------------
Summary
-I- Stage Warnings Errors Comment
-I- Discovery NA
-I- Lids Check NA
-I- Links Check NA
-I- Subnet Manager NA
-I- Port Counters NA
-I- Nodes Information NA
-I- Speed / Width checks NA
-I- Partition Keys NA
-I- Alias GUIDs NA
-I- You can find detailed errors/warnings in: /var/tmp/ibdiagnet2/ibdiagnet2.log
-E- A fatal error occurred, exiting...
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Yes, that is very likely part of the problem. Try running with I_MPI_FABRICS=shm:tcp to use sockets instead of InfiniBand*, and that will help determine if there is another problem. Once you have the InfiniBand* working correctly, try again without setting the fabric.
Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi James,
I compile my code with IBMPE and it launched with mpirun.lsf, but still wondering why it can not launch with intelmpi.
Thanks
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Is mpirun.lsf using InfiniBand*?
James.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Both of them mpirun.lsf and mpitun are using InfiniBand*. I wonder whether intelmpi support IBM-PE.
Thanks
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Are you using the IBM* MPI implementation to compile, and the Intel® MPI Library to run? That is not supported. You will need to compile and run with the same implementation.
James.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I compiled with intelmpi and try to launch it using mpirun.lsf
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Please try running with I_MPI_FABRICS=shm:tcp and let me know if this works. Also, please attach your /etc/dat.conf file.
Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page
- « Previous
-
- 1
- 2
- Next »