Intel® MPI Library
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.
2242 Discussions

The code cannot run in parallel on multiple nodes, reporting "KILLED BY SIGNAL: 9 (Killed)"

Xingguang-Zhou
Beginner
531 Views

Dear developers, 

    This is Xingguang Zhou. I use the Intel oneAPI 2023 to compile a computational fluid dynamics program - Incompact3d. After the program compilation is completed, it cannot run, and the error is consistent with:  Intel MPI ''helloworld" program not work on the machine - Intel Community

    We successfully ran the program in parallel on a single node based on the adminstrator's advice on the above post. 

   We now want to check and run the program in parallel with the multiple nodes. However, when we try to run the program in parallel with 2 nodes, the program shows the error:

[0] MPI startup(): Intel(R) MPI Library, Version 2021.8  Build 20221129 (id: 339ec755a1)
[0] MPI startup(): Copyright (C) 2003-2022 Intel Corporation.  All rights reserved.
[0] MPI startup(): library kind: release
[0] MPI startup(): shm segment size (149 MB per rank) * (56 local ranks) = 8362 MB total
[56] MPI startup(): shm segment size (149 MB per rank) * (56 local ranks) = 8362 MB total
[0] MPI startup(): max number of MPI_Request per vci: 67108864 (pools: 1)
[0] MPI startup(): File "" not found
[0] MPI startup(): Load tuning file: "/share/apps/inteloneapi2023/mpi/2021.8.0/etc/tuning_icx_shm.dat"
[0] MPI startup(): threading: mode: direct
[0] MPI startup(): threading: vcis: 1
[0] MPI startup(): threading: app_threads: -1
[0] MPI startup(): threading: runtime: generic
[0] MPI startup(): threading: progress_threads: 0
[0] MPI startup(): threading: async_progress: 0
[0] MPI startup(): threading: lock_level: global
[0] MPI startup(): tag bits available: 30 (TAG_UB value: 1073741823) 
[0] MPI startup(): source bits available: 0 (Maximal number of rank: 0) 

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 57 PID 1442 RUNNING AT node6348002
=   KILLED BY SIGNAL: 9 (Killed)
===================================================================================

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 58 PID 1443 RUNNING AT node6348002
=   KILLED BY SIGNAL: 9 (Killed)
===================================================================================

    Our configurations are as below:Operating system:

[3120103311@login02 TGV-Taylor-Green-vortex]$ cat /etc/os-release
NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"

CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"

Node: 

   Each node contains 56 CPUs and 256 GB memory.

Submit script:

#!/bin/bash
#SBATCH -J Incompact3d
#SBATCH -p node6348
#SBATCH -N 2
#SBATCH -n 112

module load openmpi4
source /share/apps/inteloneapi2023/setvars.sh

ulimit -s unlimited 
ulimit -l unlimited 

I_MPI_DEBUG=10 mpirun -genv I_MPI_FABRICS=shm -genv FI_PROVIDER=shm -np 112 ./xcompact3d > logfile

    Any suggestions are appreciate. 

Yours,

Xingguang Zhou

Xi'an, China.

0 Kudos
1 Reply
TobiasK
Moderator
480 Views

Dear @Xingguang-Zhou

on the forum I only can help you with problems you encounter using the latest release of Intel MPI 2021.12 which is contained in oneAPI 2024.1.

Killed by signal 9 can have multiple reasons, you need to dig deeper what's causing it.

Note: we do not support CentOS 7 anymore.

Since you are using Slurm you may refer to 
https://www.intel.com/content/www/us/en/docs/mpi-library/developer-guide-linux/2021-12/job-schedulers-support.html

I would start with a very simple MPI precompiled benchmark and run it without slurm and see if that works:
mpirun -n 2 IMB-MPI1



0 Kudos
Reply