I am using intel parallel studio 2015 on Intel(R) Xeon(R) CPU E5-2680 v3 (RHEL-6.5) and currently facing issues with an mpi based application(Nas Parallel Benchmark-BT). Though the issue seems application specific, I would like to have your opinions on methodology to debug/fix issues like these .
I was successful in testing the mpi setup as :-
[puneets@host01 bin]$ cat hosts.txt host02 host03 [puneets@host01 bin]$ mpirun -np 4 -ppn 2 -hostfile hosts.txt ./hello host02 host02 host03 host03
But when i try to run the application, I end up with:-
[puneets@host01 bin]$ mpirun -np 4 -ppn 2 -hostfile hosts.txt ./bt.E.4.mpi_io_full =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 25799 RUNNING AT host03 = EXIT CODE: 9 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES =================================================================================== APPLICATION TERMINATED WITH THE EXIT STRING: Killed (signal 9)
I have attached verbose log of the error (VERBOSE.txt).
[puneets@host01 bin]$ mpirun -genv I_MPI_HYDRA_DEBUG=1 -hostfile hosts.txt -genv I_MPI_DEBUG=5 -np 4 -ppn 2 ./bt.E.4.mpi_io_full
Whereas, on single node, i am able to run the application as:-
[puneets@host01 bin]$ mpirun -np 4 ./bt.E.4.mpi_io_full NAS Parallel Benchmarks 3.3 -- BT Benchmark No input file inputbt.data. Using compiled defaults Size: 1020x1020x1020 Iterations: 250 dt: 0.0000040 Number of active processes: 4 BTIO -- FULL MPI-IO write interval: 5
I am attaching the make.def and compilation log for your reference.
Any help/Hint will be very useful. Eagerly awaiting your replies.
I tried running this benchmark on compute nodes via PBS, I again end up with similar error.
Here is my job submission script:
#!/bin/bash #PBS -N NPB_N4_TPP24 #PBS -l select=2:ncpus=24:mpiprocs=2 #PBS -q test #PBS -o output1.txt #PBS -e error1.txt #PBS -P cc cd $PBS_O_WORKDIR export OMP_NUM_THREADS=12 module load suite/intel/parallelStudio mpirun -np 4 -hostfile $PBS_NODEFILE -genv I_MPI_HYDRA_DEBUG=1 -genv OMP_NUM_THREADS=12 -genv I_MPI_DEBUG=5 -ppn 2 ./bt.E.4.mpi_io_full
This seems to be an issue with NPB's class E problems.
I recompiled NPB for class D , and i was able to run the benchmark on multiple nodes.
Do let me know if you are able to identify bug with class E problem(each compute nodes in my setup has 64GB RAM).