Intel® MPI Library
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.
2226 Discussions

Job Core Dumps when using Intel MPI over two nodes

chris-wustl
Beginner
8,748 Views

Hello

 

I am having issues when running a Slurm job using Intel OneAPI MPI 2019.9 over two nodes using sbatch. All nodes can run on one node successfully but when I utilize Intel MPI parallelization over two nodes the jobs core dumps. Slurm does not throw an error and the tasks are running on both nodes. I believe I am missing something, but I don't know what. I made sure I compiled the executables with OneAPI.  Script and Error Log below. Any suggestions would be appreciated. If you need more info please let me know.

Thanks

Chris

 

runscript - 

#!/bin/bash
#SBATCH --job-name=test
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=64
#SBATCH --mem-per-cpu=2G
#SBATCH --error=error-%j.err
#SBATCH --partition=dragon
#SBATCH --time=1:00:00
#SBATCH --account=wexler
#SBATCH --propagate=STACK

# Set MPI environment variables
export I_MPI_FABRICS=sockets
export I_MPI_FALLBACK=0

srun /software/lammps/build/lmp -in in.lj

 

Error Log - 

 

[dragon1:140740:0:140740] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))
==== backtrace (tid: 140740) ====
0 /lib/x86_64-linux-gnu/libucs.so.0(ucs_handle_error+0x2e4) [0x1530cd8e1fc4]
1 /lib/x86_64-linux-gnu/libucs.so.0(+0x24fec) [0x1530cd8e5fec]
2 /lib/x86_64-linux-gnu/libucs.so.0(+0x251aa) [0x1530cd8e61aa]
3 /lib/x86_64-linux-gnu/libc.so.6(+0x42520) [0x1532b74e9520]
4 /software/intel/oneapi/mpi/2021.9.0//libfabric/lib/prov/librxm-fi.so(+0x2e886) [0x1530cc22e886]
5 /software/intel/oneapi/mpi/2021.9.0//libfabric/lib/prov/librxm-fi.so(+0x2e8a9) [0x1530cc22e8a9]
6 /software/intel/oneapi/mpi/2021.9.0//libfabric/lib/prov/librxm-fi.so(+0xe9e5) [0x1530cc20e9e5]
7 /software/intel/oneapi/mpi/2021.9.0//libfabric/lib/prov/librxm-fi.so(+0x13c17) [0x1530cc213c17]
8 /software/intel/oneapi/mpi/2021.9.0//libfabric/lib/prov/librxm-fi.so(+0x14389) [0x1530cc214389]
9 /software/intel/oneapi/mpi/2021.9.0//libfabric/lib/prov/librxm-fi.so(+0x150b0) [0x1530cc2150b0]
10 /software/intel/oneapi/mpi/2021.9.0//libfabric/lib/prov/librxm-fi.so(+0x15a9a) [0x1530cc215a9a]
11 /software/intel/oneapi/mpi/2021.9.0//libfabric/lib/prov/librxm-fi.so(+0x16b23) [0x1530cc216b23]
12 /software/intel/oneapi/mpi/2021.9.0//libfabric/lib/prov/librxm-fi.so(+0x16ce9) [0x1530cc216ce9]
13 /software/intel/oneapi/mpi/2021.9.0//libfabric/lib/prov/librxm-fi.so(+0x31a6d) [0x1530cc231a6d]
14 /software/intel/oneapi/mpi/2021.9.0//libfabric/lib/prov/librxm-fi.so(+0x319f7) [0x1530cc2319f7]
15 /software/intel/oneapi/mpi/2021.9.0//lib/release/libmpi.so.12(+0x666b8e) [0x1532b8066b8e]
16 /software/intel/oneapi/mpi/2021.9.0//lib/release/libmpi.so.12(+0x22b919) [0x1532b7c2b919]
17 /software/intel/oneapi/mpi/2021.9.0//lib/release/libmpi.so.12(+0x7c658d) [0x1532b81c658d]
18 /software/intel/oneapi/mpi/2021.9.0//lib/release/libmpi.so.12(+0x3b71c0) [0x1532b7db71c0]
19 /software/intel/oneapi/mpi/2021.9.0//lib/release/libmpi.so.12(+0x722785) [0x1532b8122785]
20 /software/intel/oneapi/mpi/2021.9.0//lib/release/libmpi.so.12(+0x2a0153) [0x1532b7ca0153]
21 /software/intel/oneapi/mpi/2021.9.0//lib/release/libmpi.so.12(MPI_Scan+0x56e) [0x1532b7b8a40e]
22 /software/lammps/build/lmp(+0x203042) [0x55cc32ced042]
23 /software/lammps/build/lmp(+0x2c0313) [0x55cc32daa313]
24 /software/lammps/build/lmp(+0xcf204) [0x55cc32bb9204]
25 /software/lammps/build/lmp(+0xcf616) [0x55cc32bb9616]
26 /software/lammps/build/lmp(+0xaddbc) [0x55cc32b97dbc]
27 /lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x1532b74d0d90]
28 /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80) [0x1532b74d0e40]
29 /software/lammps/build/lmp(+0xaee25) [0x55cc32b98e25]
=================================
==== backtrace (tid: 140741) ====
0 /lib/x86_64-linux-gnu/libucs.so.0(ucs_handle_error+0x2e4) [0x14b9bfca1fc4]
1 /lib/x86_64-linux-gnu/libucs.so.0(+0x24fec) [0x14b9bfca5fec]
2 /lib/x86_64-linux-gnu/libucs.so.0(+0x251aa) [0x14b9bfca61aa]
3 /lib/x86_64-linux-gnu/libc.so.6(+0x42520) [0x14bba9509520]
4 /software/intel/oneapi/mpi/2021.9.0//libfabric/lib/prov/librxm-fi.so(+0x2e886) [0x14b9be22e886]
5 /software/intel/oneapi/mpi/2021.9.0//libfabric/lib/prov/librxm-fi.so(+0x2e8a9) [0x14b9be22e8a9]
6 /software/intel/oneapi/mpi/2021.9.0//libfabric/lib/prov/librxm-fi.so(+0xe9e5) [0x14b9be20e9e5]
7 /software/intel/oneapi/mpi/2021.9.0//libfabric/lib/prov/librxm-fi.so(+0x13c17) [0x14b9be213c17]
8 /software/intel/oneapi/mpi/2021.9.0//libfabric/lib/prov/librxm-fi.so(+0x14389) [0x14b9be214389]
9 /software/intel/oneapi/mpi/2021.9.0//libfabric/lib/prov/librxm-fi.so(+0x150b0) [0x14b9be2150b0]
10 /software/intel/oneapi/mpi/2021.9.0//libfabric/lib/prov/librxm-fi.so(+0x15a9a) [0x14b9be215a9a]
11 /software/intel/oneapi/mpi/2021.9.0//libfabric/lib/prov/librxm-fi.so(+0x16b23) [0x14b9be216b23]
12 /software/intel/oneapi/mpi/2021.9.0//libfabric/lib/prov/librxm-fi.so(+0x16ce9) [0x14b9be216ce9]
13 /software/intel/oneapi/mpi/2021.9.0//libfabric/lib/prov/librxm-fi.so(+0x31a6d) [0x14b9be231a6d]
14 /software/intel/oneapi/mpi/2021.9.0//libfabric/lib/prov/librxm-fi.so(+0x319f7) [0x14b9be2319f7]
15 /software/intel/oneapi/mpi/2021.9.0//lib/release/libmpi.so.12(+0x666b8e) [0x14bbaa066b8e]
16 /software/intel/oneapi/mpi/2021.9.0//lib/release/libmpi.so.12(+0x22b919) [0x14bba9c2b919]
17 /software/intel/oneapi/mpi/2021.9.0//lib/release/libmpi.so.12(+0x7c658d) [0x14bbaa1c658d]
18 /software/intel/oneapi/mpi/2021.9.0//lib/release/libmpi.so.12(+0x3b71c0) [0x14bba9db71c0]
19 /software/intel/oneapi/mpi/2021.9.0//lib/release/libmpi.so.12(+0x722785) [0x14bbaa122785]
20 /software/intel/oneapi/mpi/2021.9.0//lib/release/libmpi.so.12(+0x2a0153) [0x14bba9ca0153]
21 /software/intel/oneapi/mpi/2021.9.0//lib/release/libmpi.so.12(MPI_Scan+0x56e) [0x14bba9b8a40e]
22 /software/lammps/build/lmp(+0x203042) [0x55d5788d4042]
23 /software/lammps/build/lmp(+0x2c0313) [0x55d578991313]
24 /software/lammps/build/lmp(+0xcf204) [0x55d5787a0204]
25 /software/lammps/build/lmp(+0xcf616) [0x55d5787a0616]
26 /software/lammps/build/lmp(+0xaddbc) [0x55d57877edbc]
27 /lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x14bba94f0d90]
28 /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80) [0x14bba94f0e40]
29 /software/lammps/build/lmp(+0xaee25) [0x55d57877fe25]
=================================
[dragon1:140743:0:140743] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))
==== backtrace (tid: 140743) ====
0 /lib/x86_64-linux-gnu/libucs.so.0(ucs_handle_error+0x2e4) [0x148f6eed5fc4]
1 /lib/x86_64-linux-gnu/libucs.so.0(+0x24fec) [0x148f6eed9fec]
2 /lib/x86_64-linux-gnu/libucs.so.0(+0x251aa) [0x148f6eeda1aa]
3 /lib/x86_64-linux-gnu/libc.so.6(+0x42520) [0x1491589f0520]
4 /software/intel/oneapi/mpi/2021.9.0//libfabric/lib/prov/librxm-fi.so(+0x2e886) [0x148f6d82e886]
5 /software/intel/oneapi/mpi/2021.9.0//libfabric/lib/prov/librxm-fi.so(+0x2e8a9) [0x148f6d82e8a9]
6 /software/intel/oneapi/mpi/2021.9.0//libfabric/lib/prov/librxm-fi.so(+0xe9e5) [0x148f6d80e9e5]
7 /software/intel/oneapi/mpi/2021.9.0//libfabric/lib/prov/librxm-fi.so(+0x13c17) [0x148f6d813c17]
8 /software/intel/oneapi/mpi/2021.9.0//libfabric/lib/prov/librxm-fi.so(+0x14389) [0x148f6d814389]
9 /software/intel/oneapi/mpi/2021.9.0//libfabric/lib/prov/librxm-fi.so(+0x150b0) [0x148f6d8150b0]
10 /software/intel/oneapi/mpi/2021.9.0//libfabric/lib/prov/librxm-fi.so(+0x15a9a) [0x148f6d815a9a]
11 /software/intel/oneapi/mpi/2021.9.0//libfabric/lib/prov/librxm-fi.so(+0x16b23) [0x148f6d816b23]
12 /software/intel/oneapi/mpi/2021.9.0//libfabric/lib/prov/librxm-fi.so(+0x16ce9) [0x148f6d816ce9]
13 /software/intel/oneapi/mpi/2021.9.0//libfabric/lib/prov/librxm-fi.so(+0x31a6d) [0x148f6d831a6d]
14 /software/intel/oneapi/mpi/2021.9.0//libfabric/lib/prov/librxm-fi.so(+0x319f7) [0x148f6d8319f7]
15 /software/intel/oneapi/mpi/2021.9.0//lib/release/libmpi.so.12(+0x666b8e) [0x149159466b8e]
16 /software/intel/oneapi/mpi/2021.9.0//lib/release/libmpi.so.12(+0x22b919) [0x14915902b919]
17 /software/intel/oneapi/mpi/2021.9.0//lib/release/libmpi.so.12(+0x7c658d) [0x1491595c658d]
18 /software/intel/oneapi/mpi/2021.9.0//lib/release/libmpi.so.12(+0x3b71c0) [0x1491591b71c0]
19 /software/intel/oneapi/mpi/2021.9.0//lib/release/libmpi.so.12(+0x722785) [0x149159522785]
20 /software/intel/oneapi/mpi/2021.9.0//lib/release/libmpi.so.12(+0x2a0153) [0x1491590a0153]
21 /software/intel/oneapi/mpi/2021.9.0//lib/release/libmpi.so.12(MPI_Scan+0x56e) [0x149158f8a40e]
22 /software/lammps/build/lmp(+0x203042) [0x5565946e0042]
23 /software/lammps/build/lmp(+0x2c0313) [0x55659479d313]
24 /software/lammps/build/lmp(+0xcf204) [0x5565945ac204]
25 /software/lammps/build/lmp(+0xcf616) [0x5565945ac616]
26 /software/lammps/build/lmp(+0xaddbc) [0x55659458adbc]
27 /lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x1491589d7d90]
28 /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80) [0x1491589d7e40]
29 /software/lammps/build/lmp(+0xaee25) [0x55659458be25]
=================================
==== backtrace (tid: 140692) ====
0 /lib/x86_64-linux-gnu/libucs.so.0(ucs_handle_error+0x2e4) [0x148c42ed5fc4]
1 /lib/x86_64-linux-gnu/libucs.so.0(+0x24fec) [0x148c42ed9fec]
2 /lib/x86_64-linux-gnu/libucs.so.0(+0x251aa) [0x148c42eda1aa]
3 /lib/x86_64-linux-gnu/libc.so.6(+0x42520) [0x148e2c9f0520]
4 /software/intel/oneapi/mpi/2021.9.0//libfabric/lib/prov/librxm-fi.so(+0x2e886) [0x148c4182e886]
5 /software/intel/oneapi/mpi/2021.9.0//libfabric/lib/prov/librxm-fi.so(+0x2e8a9) [0x148c4182e8a9]
6 /software/intel/oneapi/mpi/2021.9.0//libfabric/lib/prov/librxm-fi.so(+0xe9e5) [0x148c4180e9e5]
7 /software/intel/oneapi/mpi/2021.9.0//libfabric/lib/prov/librxm-fi.so(+0x13c17) [0x148c41813c17]
8 /software/intel/oneapi/mpi/2021.9.0//libfabric/lib/prov/librxm-fi.so(+0x14389) [0x148c41814389]
9 /software/intel/oneapi/mpi/2021.9.0//libfabric/lib/prov/librxm-fi.so(+0x150b0) [0x148c418150b0]
10 /software/intel/oneapi/mpi/2021.9.0//libfabric/lib/prov/librxm-fi.so(+0x15a9a) [0x148c41815a9a]
11 /software/intel/oneapi/mpi/2021.9.0//libfabric/lib/prov/librxm-fi.so(+0x16b23) [0x148c41816b23]
12 /software/intel/oneapi/mpi/2021.9.0//libfabric/lib/prov/librxm-fi.so(+0x16ce9) [0x148c41816ce9]
13 /software/intel/oneapi/mpi/2021.9.0//libfabric/lib/prov/librxm-fi.so(+0x31a6d) [0x148c41831a6d]
14 /software/intel/oneapi/mpi/2021.9.0//libfabric/lib/prov/librxm-fi.so(+0x319f7) [0x148c418319f7]
15 /software/intel/oneapi/mpi/2021.9.0//lib/release/libmpi.so.12(+0x666b8e) [0x148e2d466b8e]
16 /software/intel/oneapi/mpi/2021.9.0//lib/release/libmpi.so.12(+0x22b919) [0x148e2d02b919]
17 /software/intel/oneapi/mpi/2021.9.0//lib/release/libmpi.so.12(+0x7c658d) [0x148e2d5c658d]
18 /software/intel/oneapi/mpi/2021.9.0//lib/release/libmpi.so.12(+0x3b71c0) [0x148e2d1b71c0]
19 /software/intel/oneapi/mpi/2021.9.0//lib/release/libmpi.so.12(+0x722785) [0x148e2d522785]
20 /software/intel/oneapi/mpi/2021.9.0//lib/release/libmpi.so.12(+0x2a0153) [0x148e2d0a0153]
21 /software/intel/oneapi/mpi/2021.9.0//lib/release/libmpi.so.12(MPI_Scan+0x56e) [0x148e2cf8a40e]
22 /software/lammps/build/lmp(+0x203042) [0x556c1aa11042]
23 /software/lammps/build/lmp(+0x2c0313) [0x556c1aace313]
24 /software/lammps/build/lmp(+0xcf204) [0x556c1a8dd204]
25 /software/lammps/build/lmp(+0xcf616) [0x556c1a8dd616]
26 /software/lammps/build/lmp(+0xaddbc) [0x556c1a8bbdbc]
27 /lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x148e2c9d7d90]
28 /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80) [0x148e2c9d7e40]
29 /software/lammps/build/lmp(+0xaee25) [0x556c1a8bce25]
=================================
srun: error: dragon1: tasks 4-6,8,11,16,20-21,23,25,36-38,40,43,45,51-53,56: Segmentation fault
srun: error: dragon1: task 60: Segmentation fault (core dumped)
srun: error: dragon1: task 46: Segmentation fault (core dumped)
srun: error: dragon1: tasks 14,22,28,30,44,54,62: Segmentation fault (core dumped)
srun: error: dragon1: task 12: Segmentation fault (core dumped)
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: got SIGCONT
slurmstepd-dragon1: error: *** JOB 423 ON dragon1 CANCELLED AT 2023-07-11T13:48:01 ***
slurmstepd-dragon1: error: *** STEP 423.0 ON dragon1 CANCELLED AT 2023-07-11T13:48:01 ***
srun: forcing job termination
srun: error: dragon1: tasks 1-3,7,9-10,13,15,17-19,24,26-27,29,31-35,39,41-42,47-50,55,57-59,61,63: Terminated
srun: error: dragon2: tasks 64-127: Terminated
srun: error: dragon1: task 0: Terminated
root@bear:/data1/wexler/lammps-test#

 

 

0 Kudos
22 Replies
chris-wustl
Beginner
993 Views

Hi Aishwarya

 

 Sorry I was on vacation. It turns out that the network card we were using is not certified by Ubuntu. Dell is sending new cards. I will not know the result until they arrive. I would like to keep the case up until this issue is resolved in case this isn't the answer. But I totally understand if you need to close it. Really appreciate all the effort. If you can keep it open, I will respond back.

 

Thanks

 

Chris

 

 

0 Kudos
AishwaryaCV_Intel
Moderator
1,008 Views

Hi,


We have not heard back from you. This thread will no longer be monitored by Intel. If you need further assistance, please post a new question.


Thanks And Regards,

Aishwarya


0 Kudos
Reply