- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I tried to run one of my workload model for training on a CentOs cluster for MPI analysis. Please find below the command used and the error is displayed below. Request your help in resolving the issue.
Commands used
mpiexec –ppn 1 -- ./scripts/run_intelcaffe.sh --hostfile ~/mpd.hosts --solver models/intel_optimized_models/multinode/resnet50_8nodes_2s/solver.prototxt --network tcp --netmask enp175s0 --benchmark mpi
mpirun –ppn 1 –l amplxe-cl -collect hotspots -k sampling-mode=hw -result-dir results -- ./scripts/run_intelcaffe.sh --hostfile ~/mpd.hosts --solver models/intel_optimized_models/multinode/resnet50_8nodes_2s/solver.prototxt --network tcp --netmask enp175s0 --benchmark mpi
I keep getting the following error.
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 26 PID 72362 RUNNING AT node001
= EXIT STATUS: 255
===================================================================================
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 27 PID 72363 RUNNING AT node001
= KILLED BY SIGNAL: 9 (Killed)
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi, Thallam. Do I get it right you get a crash without ampxle too?
Let's check if issue is in environment/mpirun area, by running a simpler test, like:
mpiexec.hydra -n 2 IMB-MPI1 Barrier
Try supplying -n <process_count> argument, in case your scheduler provides node list to mpiexec too.
A run with -v option and I_MPI_DEBUG=10 will give you a longer log, which you can post here.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page