Community
cancel
Showing results for 
Search instead for 
Did you mean: 
johnappleyard377
Beginner
56 Views

Recover from crash

Hi

I'm devloping an MPI application on a single CPU shared memory machine, and sometimes after a crash, I can't start my program again. The message I get is shown below. I've tried terminating all the MPI processes and restarting the service, but the only way I've found to get going again is to reboot the machine. Is there another way to recover without rebooting?

c:\Users\John\Documents\xyz\DP>mpiexec -n 3 -l -mapall ..\2009\xyz_dbg_64 paralleldp
op_read error on left context: generic socket failure, error stack:
MPIDU_Sock_wait(2815): The specified network name is no longer available. (errno 64)
unable to read the cmd header on the left context, generic socket failure, error stack:
MPIDU_Sock_wait(2815): The specified network name is no longer available. (errno 64).
mpiexec aborting job...
several ^C to get DOS prompt back
0 Kudos
3 Replies
TimP
Black Belt
56 Views


The standard way to clean up with Intel MPI or MPICH2 is mpdallexit, after which mpdboot or mpirun should work.
xuy3
Beginner
56 Views

Quoting - tim18

The standard way to clean up with Intel MPI or MPICH2 is mpdallexit, after which mpdboot or mpirun should work.
Hello,

I meet the same problem in Windows XP platform. I think it should be something wrong with the -mapall and -map option for mpiexec in windows platform. Since the mpiallexit only exists in Linux platform. There are no helpful at all.

If someone can give some useful information, it would be great.


Dmitry_K_Intel2
Employee
56 Views

Quoting - xuy3@psu.edu
Hello,

I meet the same problem in Windows XP platform. I think it should be something wrong with the -mapall and -map option for mpiexec in windows platform. Since the mpiallexit only exists in Linux platform. There are no helpful at all.

If someone can give some useful information, it would be great.



Could you please try to use "mpdkilljob -a". These commands (mpdallexit and mpdkilljob) doesn't always work. Sometimes it's impossible to get information about MPD ring.
Reply