- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
-------------
rank 63 in job 1 blade4_34649 caused collective abort of all ranks
exit status of rank 63: killed by signal 9
---
---------------
I'm not sure if it is Intel MPI related error or an error in the application.
-------------------
[kunal@GPUBlade exp]$ which mpirun
/opt/intel/impi/4.0.1.007/intel64/bin/mpirun
[kunal@GPUBlade exp]$ mpirun --version
Intel MPI Library for Linux, 64-bit applications, Version 4.0 Update 1 Build 20100910
Copyright (C) 2003-2010 Intel Corporation. All rights reserved.
[kunal@GPUBlade exp]$ mpdtrace -l
GPUBlade_37085 (GPUBlade)
blade4_57372 (192.168.1.102)
-------------------
Any suggestions on how do I go about debugging this error ?
Kunal
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Having only information about MPI library it's hardly possible to say anything about this issue.
It can be incorrect buffer allocation, lack of memory, unstable connection... Anything.
As first step, could you run your application with "-check_mpi" option? Just run: "mpirun -check_mpi ...."
Do you see the same issue using less cores? Is your issue absolutely resproducable?
BTW: using "mpirun" you don't need to have mpd ring - "mpirun" creates new mpd ring, starts application, stops previously created mpd ring.
Also, compiling your application with '-g' and running with I_MPI_DEBUG=5 (or higher) you'll get additional information which may help you to understand the issue.
Regards!
---Dmitry
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Looks like first argument of function MPI_COMM_DUP is incorrect.
As an example: MPI_COMM_DUP(MPI_COMM_WORLD, new_comm, ierr)
The arg should be INTEGER.
Regards!
Dmitry
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi ,
I have compiled espresso with intel mpi and MKL library but getting error Failure during collective error when ever it is working fine with openmpi.
is there problem with intel mpi
Fatal error in PMPI_Bcast: Other MPI error, error stack:
PMPI_Bcast(2112)........: MPI_Bcast(buf=0x516f460, count=96, MPI_DOUBLE_PRECISION, root=4, comm=0x84000004) failed
MPIR_Bcast_impl(1670)...:
I_MPIR_Bcast_intra(1887): Failure during collective
MPIR_Bcast_intra(1524)..: Failure during collective
Fatal error in PMPI_Bcast: Other MPI error, error stack:
PMPI_Bcast(2112)........: MPI_Bcast(buf=0x5300310, count=96, MPI_DOUBLE_PRECISION, root=4, comm=0x84000004) failed
MPIR_Bcast_impl(1670)...:
I_MPIR_Bcast_intra(1887): Failure during collective
MPIR_Bcast_intra(1524)..: Failure during collective
Fatal error in PMPI_Bcast: Other MPI error, error stack:
PMPI_Bcast(2112)........: MPI_Bcast(buf=0x6b295c0, count=96, MPI_DOUBLE_PRECISION, root=4, comm=0x84000004) failed
MPIR_Bcast_impl(1670)...:
I_MPIR_Bcast_intra(1887): Failure during collective
MPIR_Bcast_intra(1524)..: Failure during collective
Fatal error in PMPI_Bcast: Other MPI error, error stack:
PMPI_Bcast(2112)........: MPI_Bcast(buf=0x67183d0, count=96, MPI_DOUBLE_PRECISION, root=4, comm=0x84000004) failed
MPIR_Bcast_impl(1670)...:
I_MPIR_Bcast_intra(1887): Failure during collective
MPIR_Bcast_intra(1524)..: Failure during collective
Fatal error in PMPI_Bcast: Other MPI error, error stack:
PMPI_Bcast(2112)........: MPI_Bcast(buf=0x4f794c0, count=96, MPI_DOUBLE_PRECISION, root=4, comm=0x84000004) failed
MPIR_Bcast_impl(1670)...:
I_MPIR_Bcast_intra(1887): Failure during collective
MPIR_Bcast_intra(1524)..: Failure during collective
[0:n125] unexpected disconnect completion event from [22:n122]
Assertion failed in file ../../dapl_conn_rc.c at line 1128: 0
internal ABORT - process 0
Fatal error in PMPI_Bcast: Other MPI error, error stack:
PMPI_Bcast(2112)........: MPI_Bcast(buf=0x56bfe30, count=96, MPI_DOUBLE_PRECISION, root=4, comm=0x84000004) failed
MPIR_Bcast_impl(1670)...:
I_MPIR_Bcast_intra(1887): Failure during collective
MPIR_Bcast_intra(1524)..: Failure during collective
/var/spool/PBS/mom_priv/epilogue: line 30: kill: (5089) - No such process
Kindly help us for resolving this
Thanks
sanjiv
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page