Intel® oneAPI HPC Toolkit
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.

Recover from crash

johnappleyard377
Beginner
152 Views
Hi

I'm devloping an MPI application on a single CPU shared memory machine, and sometimes after a crash, I can't start my program again. The message I get is shown below. I've tried terminating all the MPI processes and restarting the service, but the only way I've found to get going again is to reboot the machine. Is there another way to recover without rebooting?

c:\Users\John\Documents\xyz\DP>mpiexec -n 3 -l -mapall ..\2009\xyz_dbg_64 paralleldp
op_read error on left context: generic socket failure, error stack:
MPIDU_Sock_wait(2815): The specified network name is no longer available. (errno 64)
unable to read the cmd header on the left context, generic socket failure, error stack:
MPIDU_Sock_wait(2815): The specified network name is no longer available. (errno 64).
mpiexec aborting job...
several ^C to get DOS prompt back
0 Kudos
3 Replies
TimP
Black Belt
152 Views

The standard way to clean up with Intel MPI or MPICH2 is mpdallexit, after which mpdboot or mpirun should work.
xuy3
Beginner
152 Views
Quoting - tim18

The standard way to clean up with Intel MPI or MPICH2 is mpdallexit, after which mpdboot or mpirun should work.
Hello,

I meet the same problem in Windows XP platform. I think it should be something wrong with the -mapall and -map option for mpiexec in windows platform. Since the mpiallexit only exists in Linux platform. There are no helpful at all.

If someone can give some useful information, it would be great.


Dmitry_K_Intel2
Employee
152 Views
Quoting - xuy3@psu.edu
Hello,

I meet the same problem in Windows XP platform. I think it should be something wrong with the -mapall and -map option for mpiexec in windows platform. Since the mpiallexit only exists in Linux platform. There are no helpful at all.

If someone can give some useful information, it would be great.



Could you please try to use "mpdkilljob -a". These commands (mpdallexit and mpdkilljob) doesn't always work. Sometimes it's impossible to get information about MPD ring.
Reply