Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Naveen_T_Intel
Employee
106 Views

Intel mpirun error - AI workload

Hi,

  I tried to run one of my workload model for training on a CentOs cluster for MPI analysis. Please find below the command used and the error is displayed below. Request your help in resolving the issue. 

Commands used 

mpiexec  –ppn 1 -- ./scripts/run_intelcaffe.sh --hostfile ~/mpd.hosts --solver models/intel_optimized_models/multinode/resnet50_8nodes_2s/solver.prototxt --network tcp --netmask enp175s0 --benchmark mpi

mpirun  –ppn 1 –l amplxe-cl -collect hotspots -k sampling-mode=hw -result-dir results -- ./scripts/run_intelcaffe.sh --hostfile ~/mpd.hosts --solver models/intel_optimized_models/multinode/resnet50_8nodes_2s/solver.prototxt --network tcp --netmask enp175s0 --benchmark mpi

I keep getting the following error. 

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 26 PID 72362 RUNNING AT node001
=   EXIT STATUS: 255
===================================================================================

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 27 PID 72363 RUNNING AT node001
=   KILLED BY SIGNAL: 9 (Killed)
 

0 Kudos
1 Reply
Maksim_B_Intel
Employee
106 Views

Hi, Thallam. Do I get it right you get a crash without ampxle too?

Let's check if issue is in environment/mpirun area, by running a simpler test, like:

mpiexec.hydra -n 2 IMB-MPI1 Barrier

Try supplying -n <process_count> argument, in case your scheduler provides node list to mpiexec too.

A run with -v option and I_MPI_DEBUG=10 will give you a longer log, which you can post here.

Reply