- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I am running a job on a 4,000-node cluster with Infiniband. For small scale like 8 to 64 node, command mpirun works well; and for medium sclale like 256 to 512 node, mpiexec.hydra has to be used; but when it goes up to 1024 node, I got errors, see attached. My job script is like this:
module load intel-compilers/12.1.0
module load intelmpi/4.0.3.008
#mpirun -np 64 -perhost 1 -hostfile $PBS_NODEFILE ./paraEllip3d input.txt
mpiexec.hydra -np 1000 -perhost 1 -hostfile $PBS_NODEFILE ./paraEllip3d input.txt
The errors from 1024 nodes are:
[proxy:0:24@r7i6n3] HYDT_bscu_wait_for_completion (./tools/bootstrap/utils/bscu_wait.c:101): one of the processes terminated badly; aborting
[proxy:0:24@r7i6n3] HYDT_bsci_wait_for_completion (./tools/bootstrap/src/bsci_wait.c:18): bootstrap device returned error waiting for completion
[proxy:0:24@r7i6n3] HYD_pmci_wait_for_childs_completion (./pm/pmiserv/pmip_utils.c:959): bootstrap server returned error waiting for completion
[proxy:0:24@r7i6n3] main (./pm/pmiserv/pmip.c:378): error waiting for event children completion
[proxy:0:15@r3i6n0] HYDT_bscu_wait_for_completion (./tools/bootstrap/utils/bscu_wait.c:101): one of the processes terminated badly; aborting
[proxy:0:15@r3i6n0] HYDT_bsci_wait_for_completion (./tools/bootstrap/src/bsci_wait.c:18): bootstrap device returned error waiting for completion
[proxy:0:15@r3i6n0] HYD_pmci_wait_for_childs_completion (./pm/pmiserv/pmip_utils.c:959): bootstrap server returned error waiting for completion
[proxy:0:15@r3i6n0] main (./pm/pmiserv/pmip.c:378): error waiting for event children completion
[mpiexec@r3i3n10] connection to proxy terminated unexpectedly
[proxy:0:27@r7i6n11] HYDT_bscu_wait_for_completion (./tools/bootstrap/utils/bscu_wait.c:101): one of the processes terminated badly; aborting
[proxy:0:27@r7i6n11] HYDT_bsci_wait_for_completion (./tools/bootstrap/src/bsci_wait.c:18): bootstrap device returned error waiting for completion
[proxy:0:27@r7i6n11] HYD_pmci_wait_for_childs_completion (./pm/pmiserv/pmip_utils.c:959): bootstrap server returned error waiting for completion
[proxy:0:27@r7i6n11] main (./pm/pmiserv/pmip.c:378): error waiting for event children completion
[mpiexec@r3i3n10] connection to proxy terminated unexpectedly
[proxy:0:26@r7i6n10] HYDT_bscu_wait_for_completion (./tools/bootstrap/utils/bscu_wait.c:101): one of the processes terminated badly; aborting
[proxy:0:26@r7i6n10] HYDT_bsci_wait_for_completion (./tools/bootstrap/src/bsci_wait.c:18): bootstrap device returned error waiting for completion
[proxy:0:26@r7i6n10] HYD_pmci_wait_for_childs_completion (./pm/pmiserv/pmip_utils.c:959): bootstrap server returned error waiting for completion
[proxy:0:26@r7i6n10] main (./pm/pmiserv/pmip.c:378): error waiting for event children completion
[proxy:0:2@r3i4n0] HYDT_bscu_wait_for_completion (./tools/bootstrap/utils/bscu_wait.c:101): one of the processes terminated badly; aborting
[proxy:0:2@r3i4n0] HYDT_bsci_wait_for_completion (./tools/bootstrap/src/bsci_wait.c:18): bootstrap device returned error waiting for completion
[proxy:0:2@r3i4n0] HYD_pmci_wait_for_childs_completion (./pm/pmiserv/pmip_utils.c:959): bootstrap server returned error waiting for completion
[proxy:0:2@r3i4n0] main (./pm/pmiserv/pmip.c:378): error waiting for event children completion
[proxy:0:17@r4i1n5] HYDT_bscu_wait_for_completion (./tools/bootstrap/utils/bscu_wait.c:101): one of the processes terminated badly; aborting
[proxy:0:17@r4i1n5] HYDT_bsci_wait_for_completion (./tools/bootstrap/src/bsci_wait.c:18): bootstrap device returned error waiting for completion
[proxy:0:17@r4i1n5] HYD_pmci_wait_for_childs_completion (./pm/pmiserv/pmip_utils.c:959): bootstrap server returned error waiting for completion
[proxy:0:17@r4i1n5] main (./pm/pmiserv/pmip.c:378): error waiting for event children completion
[proxy:0:23@r7i6n1] HYDT_bscu_wait_for_completion (./tools/bootstrap/utils/bscu_wait.c:101): one of the processes terminated badly; aborting
[proxy:0:23@r7i6n1] HYDT_bsci_wait_for_completion (./tools/bootstrap/src/bsci_wait.c:18): bootstrap device returned error waiting for completion
[proxy:0:23@r7i6n1] HYD_pmci_wait_for_childs_completion (./pm/pmiserv/pmip_utils.c:959): bootstrap server returned error waiting for completion
[proxy:0:23@r7i6n1] main (./pm/pmiserv/pmip.c:378): error waiting for event children completion
[proxy:0:9@r3i4n17] HYDT_bscu_wait_for_completion (./tools/bootstrap/utils/bscu_wait.c:101): one of the processes terminated badly; aborting
[proxy:0:9@r3i4n17] HYDT_bsci_wait_for_completion (./tools/bootstrap/src/bsci_wait.c:18): bootstrap device returned error waiting for completion
[proxy:0:9@r3i4n17] HYD_pmci_wait_for_childs_completion (./pm/pmiserv/pmip_utils.c:959): bootstrap server returned error waiting for completion
[proxy:0:9@r3i4n17] main (./pm/pmiserv/pmip.c:378): error waiting for event children completion
[proxy:0:8@r3i4n16] HYDT_bscu_wait_for_completion (./tools/bootstrap/utils/bscu_wait.c:101): one of the processes terminated badly; aborting
[proxy:0:8@r3i4n16] HYDT_bsci_wait_for_completion (./tools/bootstrap/src/bsci_wait.c:18): bootstrap device returned error waiting for completion
[proxy:0:8@r3i4n16] HYD_pmci_wait_for_childs_completion (./pm/pmiserv/pmip_utils.c:959): bootstrap server returned error waiting for completion
[proxy:0:8@r3i4n16] main (./pm/pmiserv/pmip.c:378): error waiting for event children completion
[proxy:0:20@r7i0n12] HYDT_bscu_wait_for_completion (./tools/bootstrap/utils/bscu_wait.c:101): one of the processes terminated badly; aborting
[proxy:0:20@r7i0n12] HYDT_bsci_wait_for_completion (./tools/bootstrap/src/bsci_wait.c:18): bootstrap device returned error waiting for completion
[proxy:0:20@r7i0n12] HYD_pmci_wait_for_childs_completion (./pm/pmiserv/pmip_utils.c:959): bootstrap server returned error waiting for completion
[proxy:0:20@r7i0n12] main (./pm/pmiserv/pmip.c:378): error waiting for event children completion
[mpiexec@r3i3n10] connection to proxy terminated unexpectedly
[proxy:0:30@r7i7n6] HYDT_bscu_wait_for_completion (./tools/bootstrap/utils/bscu_wait.c:101): one of the processes terminated badly; aborting
[proxy:0:30@r7i7n6] HYDT_bsci_wait_for_completion (./tools/bootstrap/src/bsci_wait.c:18): bootstrap device returned error waiting for completion
[proxy:0:30@r7i7n6] HYD_pmci_wait_for_childs_completion (./pm/pmiserv/pmip_utils.c:959): bootstrap server returned error waiting for completion
[proxy:0:30@r7i7n6] main (./pm/pmiserv/pmip.c:378): error waiting for event children completion
[mpiexec@r3i3n10] connection to proxy terminated unexpectedly
[mpiexec@r3i3n10] connection to proxy terminated unexpectedly
[proxy:0:10@r3i5n0] HYDT_bscu_wait_for_completion (./tools/bootstrap/utils/bscu_wait.c:101): one of the processes terminated badly; aborting
[proxy:0:10@r3i5n0] HYDT_bsci_wait_for_completion (./tools/bootstrap/src/bsci_wait.c:18): bootstrap device returned error waiting for completion
[proxy:0:10@r3i5n0] HYD_pmci_wait_for_childs_completion (./pm/pmiserv/pmip_utils.c:959): bootstrap server returned error waiting for completion
[proxy:0:10@r3i5n0] main (./pm/pmiserv/pmip.c:378): error waiting for event children completion
[mpiexec@r3i3n10] connection to proxy terminated unexpectedly
[mpiexec@r3i3n10] connection to proxy terminated unexpectedly
[mpiexec@r3i3n10] connection to proxy terminated unexpectedly
[mpiexec@r3i3n10] connection to proxy terminated unexpectedly
[mpiexec@r3i3n10] connection to proxy terminated unexpectedly
[mpiexec@r3i3n10] connection to proxy terminated unexpectedly
[mpiexec@r3i3n10] connection to proxy terminated unexpectedly
[proxy:0:0@r3i3n10] HYDT_bscu_wait_for_completion (./tools/bootstrap/utils/bscu_wait.c:101): one of the processes terminated badly; aborting
[proxy:0:0@r3i3n10] HYDT_bsci_wait_for_completion (./tools/bootstrap/src/bsci_wait.c:18): bootstrap device returned error waiting for completion
[proxy:0:0@r3i3n10] HYD_pmci_wait_for_childs_completion (./pm/pmiserv/pmip_utils.c:959): bootstrap server returned error waiting for completion
[proxy:0:0@r3i3n10] main (./pm/pmiserv/pmip.c:378): error waiting for event children completion
Ctrl-C caught... cleaning up processes
[press Ctrl-C again to force abort]
[mpiexec@r3i3n10] HYDT_bscu_wait_for_completion (./tools/bootstrap/utils/bscu_wait.c:101): one of the processes terminated badly; aborting
[mpiexec@r3i3n10] HYDT_bsci_wait_for_completion (./tools/bootstrap/src/bsci_wait.c:18): bootstrap device returned error waiting for completion
[mpiexec@r3i3n10] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:521): bootstrap server returned error waiting for completion
[mpiexec@r3i3n10] main (./ui/mpich/mpiexec.c:548): process manager error waiting for completion
Start Epilogue v2.5 Fri Mar 7 09:36:40 EST 2014
Statistics cpupercent=0,cput=00:00:11,mem=130828kb,ncpus=16000,vmem=3842836kb,walltime=04:14:17
End Epilogue v2.5 Fri Mar 7 09:36:51 EST 2014
Can you please help?
Thanks,
Beichuan Yan
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Beichuan,
I would recommend trying with
[plain]I_MPI_DAPL_UD=1
I_MPI_DAPL_UD_PROVIDER=ofa-v2-mlx4_0-1u[/plain]
Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page