- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I've an awkward issue.
I'm using LSF 9.1 as job manager, and Intel Parallel Studio 2015_update1
When a I submit a simple program (hello word) using 2032 cores (117 nodes), it works well, but when I use more cores, all the processes are
created on all nodes but they hang and the program doesn't finish (it even doesn't starts).
I've tried launching the process outside LSF (mpirun -hostfile ... ) and it works fine with 2048 cores.
Anny suggestions?
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Jose. Can you provide the bsub command line, and the output of bhist -l <jobid>. Do you have access to the log files, especially the log files on the job's head node?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Michael,
the bsub command line is
% bsub -q q_1080p_1h -n 2048 -oo salida mpirun -genv I_MPI_FABRICS shm:ofa ./a.out
I run it several times. Most of them there were no output, but once I got this:
Sender: LSF System <lsfadmin@mn269>
Subject: Job 523670: <mpirun -genv I_MPI_FABRICS shm:ofa ./a.out> in cluster <cluster1> Exited
Job <mpirun -genv I_MPI_FABRICS shm:ofa ./a.out> was submitted from host <mn328> by user <jlgr> in cluster <cluster1>.
Job was executed on host(s) <16*mn269>, in queue <q_1800p_1h>, as user <jlgr> in cluster <cluster1>.
<16*mn217>
<16*mn218>
<16*mn153>
<16*mn154>
<16*mn157>
<16*mn158>
<16*mn291>
<16*mn292>
<16*mn293>
<16*mn295>
<16*mn300>
<16*mn296>
<16*mn302>
<16*mn298>
<16*mn299>
<16*mn240>
<16*mn241>
<16*mn242>
<16*mn245>
<16*mn247>
<16*mn248>
<16*mn180>
<16*mn182>
<16*mn183>
<16*mn184>
<16*mn185>
<16*mn131>
<16*mn132>
<16*mn134>
<16*mn139>
<16*mn272>
<16*mn273>
<16*mn222>
<16*mn223>
<16*mn226>
<16*mn227>
<16*mn160>
<16*mn161>
<16*mn162>
<16*mn163>
<16*mn166>
<16*mn168>
<16*mn312>
<16*mn250>
<16*mn252>
<16*mn253>
<16*mn259>
<16*mn190>
<16*mn193>
<16*mn195>
<16*mn196>
<16*mn201>
<16*mn197>
<16*mn199>
<16*mn204>
<16*mn205>
<16*mn206>
<16*mn144>
<16*mn146>
<16*mn281>
<16*mn284>
<16*mn287>
<16*mn288>
<16*mn289>
<16*mn230>
<16*mn231>
<16*mn232>
<16*mn238>
<16*mn172>
<16*mn173>
<16*mn174>
<16*mn178>
<16*mn179>
<16*mn277>
<16*mn212>
<16*mn313>
<16*mn306>
<16*mn260>
<16*mn310>
<16*mn149>
<16*mn188>
<16*mn257>
<16*mn304>
<16*mn255>
<16*mn305>
<16*mn211>
<16*mn189>
<16*mn309>
<16*mn275>
<16*mn225>
<16*mn129>
<16*mn219>
<16*mn262>
<16*mn159>
<16*mn164>
<16*mn207>
<16*mn239>
<16*mn258>
<16*mn152>
<16*mn208>
<16*mn156>
<16*mn187>
<16*mn276>
<16*mn151>
<16*mn145>
<16*mn307>
<16*mn176>
<16*mn203>
<16*mn141>
<16*mn249>
<16*mn214>
<16*mn290>
<16*mn221>
<16*mn202>
<16*mn171>
<16*mn140>
<16*mn210>
<16*mn301>
<16*mn286>
<16*mn303>
<16*mn236>
<16*mn213>
<16*mn137>
<16*mn274>
<16*mn147>
<16*mn285>
<16*mn228>
</home/dgsca/jlgr> was used as the home directory.
</tmpu/dgsca/jlgr> was used as the working directory.
Started at Mon Jun 8 12:26:55 2015
Results reported on Mon Jun 8 12:31:57 2015
Your job looked like:
------------------------------------------------------------
# LSBATCH: User input
mpirun -genv I_MPI_FABRICS shm:ofa ./a.out
------------------------------------------------------------
Exited with exit code 255.
Resource usage summary:
CPU time : 1.84 sec.
Max Memory : 4597.02 MB
Average Memory : 3947.10 MB
Total Requested Memory : -
Delta Memory : -
Max Swap : 58004 MB
The output (if any) follows:
[proxy:0:32@mn273] HYDU_sock_write (../../utils/sock/sock.c:417): write error (Bad file descriptor)
[proxy:0:32@mn273] main (../../pm/pmiserv/pmip.c:406): unable to send control code to the server
[proxy:0:33@mn222] HYDU_sock_write (../../utils/sock/sock.c:417): write error (Bad file descriptor)
[proxy:0:33@mn222] main (../../pm/pmiserv/pmip.c:406): unable to send control code to the server
[proxy:0:34@mn223] HYDU_sock_write (../../utils/sock/sock.c:417): write error (Bad file descriptor)
[proxy:0:34@mn223] main (../../pm/pmiserv/pmip.c:406): unable to send control code to the server
Jun 8 12:31:51 2015 21018 3 9.1.3 lsb_launch(): Failed while executing tasks.
[proxy:0:9@mn293] HYDT_bscu_wait_for_completion (../../tools/bootstrap/utils/bscu_wait.c:113): one of the processes terminated badly; aborting
[proxy:0:9@mn293] HYDT_bsci_wait_for_completion (../../tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for completion
[proxy:0:9@mn293] HYD_pmci_wait_for_childs_completion (../../pm/pmiserv/pmip_utils.c:1718): bootstrap server returned error waiting for complet
ion
[proxy:0:9@mn293] main (../../pm/pmiserv/pmip.c:454): error waiting for event children completion
[mpiexec@mn269] control_cb (../../pm/pmiserv/pmiserv_cb.c:823): connection to proxy 9 at host mn293 failed
[mpiexec@mn269] HYDT_dmxu_poll_wait_for_event (../../tools/demux/demux_poll.c:76): callback returned error status
[mpiexec@mn269] HYD_pmci_wait_for_completion (../../pm/pmiserv/pmiserv_pmci.c:495): error waiting for event
[mpiexec@mn269] main (../../ui/mpich/mpiexec.c:1011): process manager error waiting for completion
I've access to the log files of all nodes, but so far I've not found anything relevant.
I've also run the program outside LSF using
% mpirun -hostsfile lh ./a.out
and it works fine (lh is just a node per line) .
I've also tried
% bsub -q q_1080p_1h -n 2048 -oo salida -m <list of nodes> mpirun -hostfile lh -genv I_MPI_FABRICS shm:ofa ./a.out
(where <list of nodes> has the same nodes as lh file) and it also works!)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
we have exactly the same problem with LSF 9.1.3 and more than 127 nodes with intelmpi.
Is there any solution?
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page