Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Kunal_Rao
Novice
255 Views

Application crashes when run on 2 nodes (caused collective abort of all ranks, killed by signal 9)

Hi,
We have a huge HPC application compiled with Intel compiler and uses Intel MPI Library. It works fine when run on single node (with multiple processes) but crashes when run on 2 nodes (with multiple processes) with the following message :

-------------
rank 63 in job 1 blade4_34649 caused collective abort of all ranks
exit status of rank 63: killed by signal 9

---
---------------

I'm not sure if it is Intel MPI related error or an error in the application.
Some info related to Intel MPI that we are using and the mpd ring consisting of 2 nodes.

-------------------
[kunal@GPUBlade exp]$ which mpirun
/opt/intel/impi/4.0.1.007/intel64/bin/mpirun

[kunal@GPUBlade exp]$ mpirun --version
Intel MPI Library for Linux, 64-bit applications, Version 4.0 Update 1 Build 20100910
Copyright (C) 2003-2010 Intel Corporation. All rights reserved.

[kunal@GPUBlade exp]$ mpdtrace -l
GPUBlade_37085 (GPUBlade)
blade4_57372 (192.168.1.102)

-------------------

Any suggestions on how do I go about debugging this error ?
Thanks & Regards,
Kunal
0 Kudos
4 Replies
Dmitry_K_Intel2
Employee
255 Views

Hi Kunal,

Having only information about MPI library it's hardly possible to say anything about this issue.
It can be incorrect buffer allocation, lack of memory, unstable connection... Anything.
As first step, could you run your application with "-check_mpi" option? Just run: "mpirun -check_mpi ...."
Do you see the same issue using less cores? Is your issue absolutely resproducable?
BTW: using "mpirun" you don't need to have mpd ring - "mpirun" creates new mpd ring, starts application, stops previously created mpd ring.
Also, compiling your application with '-g' and running with I_MPI_DEBUG=5 (or higher) you'll get additional information which may help you to understand the issue.

Regards!
---Dmitry
Kunal_Rao
Novice
255 Views

Thanks Dmitry for your reply. Your suggestions were helpful. I was able to give a run with those extra debugging flags and was able to get some more insight into the problem.
The application crashes with the following message inmpi_comm_dup_MPI call in the application :
----------
[0] ERROR: LOCAL:MPI:CALL_FAILED: error
[0] ERROR: Invalid communicator.
[0] ERROR: Error occurred at:
[0] ERROR: mpi_comm_dup_(comm=0xffffffffc4000000 <>, *newcomm=0x3d930e0, *ierr=0x7fffd2afbddc)
[0] ERROR: LOCAL:MPI:CALL_FAILED: error[0] ERROR: Invalid communicator.[0] ERROR: Error occurred at:[0] ERROR: mpi_comm_dup_(comm=0xffffffffc4000000 <>, *newcomm=0x3d930e0, *ierr=0x7fffd2afbddc)
---------
I'll look more into it. Let me know if you have some further suggestions.
Thanks & Regards,
Kunal
Dmitry_K_Intel2
Employee
255 Views

Kunal,

Looks like first argument of function MPI_COMM_DUP is incorrect.
As an example: MPI_COMM_DUP(MPI_COMM_WORLD, new_comm, ierr)
The arg should be INTEGER.

Regards!
Dmitry
Sanjiv_T_
Beginner
255 Views

Hi ,

I have compiled espresso with intel mpi and MKL library but  getting error Failure during collective error when ever it is working fine with openmpi.

is there problem with intel mpi


Fatal error in PMPI_Bcast: Other MPI error, error stack:
PMPI_Bcast(2112)........: MPI_Bcast(buf=0x516f460, count=96, MPI_DOUBLE_PRECISION, root=4, comm=0x84000004) failed
MPIR_Bcast_impl(1670)...:
I_MPIR_Bcast_intra(1887): Failure during collective
MPIR_Bcast_intra(1524)..: Failure during collective
Fatal error in PMPI_Bcast: Other MPI error, error stack:
PMPI_Bcast(2112)........: MPI_Bcast(buf=0x5300310, count=96, MPI_DOUBLE_PRECISION, root=4, comm=0x84000004) failed
MPIR_Bcast_impl(1670)...:
I_MPIR_Bcast_intra(1887): Failure during collective
MPIR_Bcast_intra(1524)..: Failure during collective
Fatal error in PMPI_Bcast: Other MPI error, error stack:
PMPI_Bcast(2112)........: MPI_Bcast(buf=0x6b295c0, count=96, MPI_DOUBLE_PRECISION, root=4, comm=0x84000004) failed
MPIR_Bcast_impl(1670)...:
I_MPIR_Bcast_intra(1887): Failure during collective
MPIR_Bcast_intra(1524)..: Failure during collective
Fatal error in PMPI_Bcast: Other MPI error, error stack:
PMPI_Bcast(2112)........: MPI_Bcast(buf=0x67183d0, count=96, MPI_DOUBLE_PRECISION, root=4, comm=0x84000004) failed
MPIR_Bcast_impl(1670)...:
I_MPIR_Bcast_intra(1887): Failure during collective
MPIR_Bcast_intra(1524)..: Failure during collective
Fatal error in PMPI_Bcast: Other MPI error, error stack:
PMPI_Bcast(2112)........: MPI_Bcast(buf=0x4f794c0, count=96, MPI_DOUBLE_PRECISION, root=4, comm=0x84000004) failed
MPIR_Bcast_impl(1670)...:
I_MPIR_Bcast_intra(1887): Failure during collective
MPIR_Bcast_intra(1524)..: Failure during collective
[0:n125] unexpected disconnect completion event from [22:n122]
Assertion failed in file ../../dapl_conn_rc.c at line 1128: 0
internal ABORT - process 0
Fatal error in PMPI_Bcast: Other MPI error, error stack:
PMPI_Bcast(2112)........: MPI_Bcast(buf=0x56bfe30, count=96, MPI_DOUBLE_PRECISION, root=4, comm=0x84000004) failed
MPIR_Bcast_impl(1670)...:
I_MPIR_Bcast_intra(1887): Failure during collective
MPIR_Bcast_intra(1524)..: Failure during collective
/var/spool/PBS/mom_priv/epilogue: line 30: kill: (5089) - No such process


Kindly help us for resolving this


Thanks
sanjiv

Reply