Intel® MPI Library
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.
Announcements
FPGA community forums and blogs on community.intel.com are migrating to the new Altera Community and are read-only. For urgent support needs during this transition, please visit the FPGA Design Resources page or contact an Altera Authorized Distributor.
2275 Discussions

Job Core Dumps when using Intel MPI over two nodes

chris-wustl
Beginner
11,039 Views

Hello

 

I am having issues when running a Slurm job using Intel OneAPI MPI 2019.9 over two nodes using sbatch. All nodes can run on one node successfully but when I utilize Intel MPI parallelization over two nodes the jobs core dumps. Slurm does not throw an error and the tasks are running on both nodes. I believe I am missing something, but I don't know what. I made sure I compiled the executables with OneAPI.  Script and Error Log below. Any suggestions would be appreciated. If you need more info please let me know.

Thanks

Chris

 

runscript - 

#!/bin/bash
#SBATCH --job-name=test
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=64
#SBATCH --mem-per-cpu=2G
#SBATCH --error=error-%j.err
#SBATCH --partition=dragon
#SBATCH --time=1:00:00
#SBATCH --account=wexler
#SBATCH --propagate=STACK

# Set MPI environment variables
export I_MPI_FABRICS=sockets
export I_MPI_FALLBACK=0

srun /software/lammps/build/lmp -in in.lj

 

Error Log - 

 

[dragon1:140740:0:140740] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))
==== backtrace (tid: 140740) ====
0 /lib/x86_64-linux-gnu/libucs.so.0(ucs_handle_error+0x2e4) [0x1530cd8e1fc4]
1 /lib/x86_64-linux-gnu/libucs.so.0(+0x24fec) [0x1530cd8e5fec]
2 /lib/x86_64-linux-gnu/libucs.so.0(+0x251aa) [0x1530cd8e61aa]
3 /lib/x86_64-linux-gnu/libc.so.6(+0x42520) [0x1532b74e9520]
4 /software/intel/oneapi/mpi/2021.9.0//libfabric/lib/prov/librxm-fi.so(+0x2e886) [0x1530cc22e886]
5 /software/intel/oneapi/mpi/2021.9.0//libfabric/lib/prov/librxm-fi.so(+0x2e8a9) [0x1530cc22e8a9]
6 /software/intel/oneapi/mpi/2021.9.0//libfabric/lib/prov/librxm-fi.so(+0xe9e5) [0x1530cc20e9e5]
7 /software/intel/oneapi/mpi/2021.9.0//libfabric/lib/prov/librxm-fi.so(+0x13c17) [0x1530cc213c17]
8 /software/intel/oneapi/mpi/2021.9.0//libfabric/lib/prov/librxm-fi.so(+0x14389) [0x1530cc214389]
9 /software/intel/oneapi/mpi/2021.9.0//libfabric/lib/prov/librxm-fi.so(+0x150b0) [0x1530cc2150b0]
10 /software/intel/oneapi/mpi/2021.9.0//libfabric/lib/prov/librxm-fi.so(+0x15a9a) [0x1530cc215a9a]
11 /software/intel/oneapi/mpi/2021.9.0//libfabric/lib/prov/librxm-fi.so(+0x16b23) [0x1530cc216b23]
12 /software/intel/oneapi/mpi/2021.9.0//libfabric/lib/prov/librxm-fi.so(+0x16ce9) [0x1530cc216ce9]
13 /software/intel/oneapi/mpi/2021.9.0//libfabric/lib/prov/librxm-fi.so(+0x31a6d) [0x1530cc231a6d]
14 /software/intel/oneapi/mpi/2021.9.0//libfabric/lib/prov/librxm-fi.so(+0x319f7) [0x1530cc2319f7]
15 /software/intel/oneapi/mpi/2021.9.0//lib/release/libmpi.so.12(+0x666b8e) [0x1532b8066b8e]
16 /software/intel/oneapi/mpi/2021.9.0//lib/release/libmpi.so.12(+0x22b919) [0x1532b7c2b919]
17 /software/intel/oneapi/mpi/2021.9.0//lib/release/libmpi.so.12(+0x7c658d) [0x1532b81c658d]
18 /software/intel/oneapi/mpi/2021.9.0//lib/release/libmpi.so.12(+0x3b71c0) [0x1532b7db71c0]
19 /software/intel/oneapi/mpi/2021.9.0//lib/release/libmpi.so.12(+0x722785) [0x1532b8122785]
20 /software/intel/oneapi/mpi/2021.9.0//lib/release/libmpi.so.12(+0x2a0153) [0x1532b7ca0153]
21 /software/intel/oneapi/mpi/2021.9.0//lib/release/libmpi.so.12(MPI_Scan+0x56e) [0x1532b7b8a40e]
22 /software/lammps/build/lmp(+0x203042) [0x55cc32ced042]
23 /software/lammps/build/lmp(+0x2c0313) [0x55cc32daa313]
24 /software/lammps/build/lmp(+0xcf204) [0x55cc32bb9204]
25 /software/lammps/build/lmp(+0xcf616) [0x55cc32bb9616]
26 /software/lammps/build/lmp(+0xaddbc) [0x55cc32b97dbc]
27 /lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x1532b74d0d90]
28 /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80) [0x1532b74d0e40]
29 /software/lammps/build/lmp(+0xaee25) [0x55cc32b98e25]
=================================
==== backtrace (tid: 140741) ====
0 /lib/x86_64-linux-gnu/libucs.so.0(ucs_handle_error+0x2e4) [0x14b9bfca1fc4]
1 /lib/x86_64-linux-gnu/libucs.so.0(+0x24fec) [0x14b9bfca5fec]
2 /lib/x86_64-linux-gnu/libucs.so.0(+0x251aa) [0x14b9bfca61aa]
3 /lib/x86_64-linux-gnu/libc.so.6(+0x42520) [0x14bba9509520]
4 /software/intel/oneapi/mpi/2021.9.0//libfabric/lib/prov/librxm-fi.so(+0x2e886) [0x14b9be22e886]
5 /software/intel/oneapi/mpi/2021.9.0//libfabric/lib/prov/librxm-fi.so(+0x2e8a9) [0x14b9be22e8a9]
6 /software/intel/oneapi/mpi/2021.9.0//libfabric/lib/prov/librxm-fi.so(+0xe9e5) [0x14b9be20e9e5]
7 /software/intel/oneapi/mpi/2021.9.0//libfabric/lib/prov/librxm-fi.so(+0x13c17) [0x14b9be213c17]
8 /software/intel/oneapi/mpi/2021.9.0//libfabric/lib/prov/librxm-fi.so(+0x14389) [0x14b9be214389]
9 /software/intel/oneapi/mpi/2021.9.0//libfabric/lib/prov/librxm-fi.so(+0x150b0) [0x14b9be2150b0]
10 /software/intel/oneapi/mpi/2021.9.0//libfabric/lib/prov/librxm-fi.so(+0x15a9a) [0x14b9be215a9a]
11 /software/intel/oneapi/mpi/2021.9.0//libfabric/lib/prov/librxm-fi.so(+0x16b23) [0x14b9be216b23]
12 /software/intel/oneapi/mpi/2021.9.0//libfabric/lib/prov/librxm-fi.so(+0x16ce9) [0x14b9be216ce9]
13 /software/intel/oneapi/mpi/2021.9.0//libfabric/lib/prov/librxm-fi.so(+0x31a6d) [0x14b9be231a6d]
14 /software/intel/oneapi/mpi/2021.9.0//libfabric/lib/prov/librxm-fi.so(+0x319f7) [0x14b9be2319f7]
15 /software/intel/oneapi/mpi/2021.9.0//lib/release/libmpi.so.12(+0x666b8e) [0x14bbaa066b8e]
16 /software/intel/oneapi/mpi/2021.9.0//lib/release/libmpi.so.12(+0x22b919) [0x14bba9c2b919]
17 /software/intel/oneapi/mpi/2021.9.0//lib/release/libmpi.so.12(+0x7c658d) [0x14bbaa1c658d]
18 /software/intel/oneapi/mpi/2021.9.0//lib/release/libmpi.so.12(+0x3b71c0) [0x14bba9db71c0]
19 /software/intel/oneapi/mpi/2021.9.0//lib/release/libmpi.so.12(+0x722785) [0x14bbaa122785]
20 /software/intel/oneapi/mpi/2021.9.0//lib/release/libmpi.so.12(+0x2a0153) [0x14bba9ca0153]
21 /software/intel/oneapi/mpi/2021.9.0//lib/release/libmpi.so.12(MPI_Scan+0x56e) [0x14bba9b8a40e]
22 /software/lammps/build/lmp(+0x203042) [0x55d5788d4042]
23 /software/lammps/build/lmp(+0x2c0313) [0x55d578991313]
24 /software/lammps/build/lmp(+0xcf204) [0x55d5787a0204]
25 /software/lammps/build/lmp(+0xcf616) [0x55d5787a0616]
26 /software/lammps/build/lmp(+0xaddbc) [0x55d57877edbc]
27 /lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x14bba94f0d90]
28 /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80) [0x14bba94f0e40]
29 /software/lammps/build/lmp(+0xaee25) [0x55d57877fe25]
=================================
[dragon1:140743:0:140743] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))
==== backtrace (tid: 140743) ====
0 /lib/x86_64-linux-gnu/libucs.so.0(ucs_handle_error+0x2e4) [0x148f6eed5fc4]
1 /lib/x86_64-linux-gnu/libucs.so.0(+0x24fec) [0x148f6eed9fec]
2 /lib/x86_64-linux-gnu/libucs.so.0(+0x251aa) [0x148f6eeda1aa]
3 /lib/x86_64-linux-gnu/libc.so.6(+0x42520) [0x1491589f0520]
4 /software/intel/oneapi/mpi/2021.9.0//libfabric/lib/prov/librxm-fi.so(+0x2e886) [0x148f6d82e886]
5 /software/intel/oneapi/mpi/2021.9.0//libfabric/lib/prov/librxm-fi.so(+0x2e8a9) [0x148f6d82e8a9]
6 /software/intel/oneapi/mpi/2021.9.0//libfabric/lib/prov/librxm-fi.so(+0xe9e5) [0x148f6d80e9e5]
7 /software/intel/oneapi/mpi/2021.9.0//libfabric/lib/prov/librxm-fi.so(+0x13c17) [0x148f6d813c17]
8 /software/intel/oneapi/mpi/2021.9.0//libfabric/lib/prov/librxm-fi.so(+0x14389) [0x148f6d814389]
9 /software/intel/oneapi/mpi/2021.9.0//libfabric/lib/prov/librxm-fi.so(+0x150b0) [0x148f6d8150b0]
10 /software/intel/oneapi/mpi/2021.9.0//libfabric/lib/prov/librxm-fi.so(+0x15a9a) [0x148f6d815a9a]
11 /software/intel/oneapi/mpi/2021.9.0//libfabric/lib/prov/librxm-fi.so(+0x16b23) [0x148f6d816b23]
12 /software/intel/oneapi/mpi/2021.9.0//libfabric/lib/prov/librxm-fi.so(+0x16ce9) [0x148f6d816ce9]
13 /software/intel/oneapi/mpi/2021.9.0//libfabric/lib/prov/librxm-fi.so(+0x31a6d) [0x148f6d831a6d]
14 /software/intel/oneapi/mpi/2021.9.0//libfabric/lib/prov/librxm-fi.so(+0x319f7) [0x148f6d8319f7]
15 /software/intel/oneapi/mpi/2021.9.0//lib/release/libmpi.so.12(+0x666b8e) [0x149159466b8e]
16 /software/intel/oneapi/mpi/2021.9.0//lib/release/libmpi.so.12(+0x22b919) [0x14915902b919]
17 /software/intel/oneapi/mpi/2021.9.0//lib/release/libmpi.so.12(+0x7c658d) [0x1491595c658d]
18 /software/intel/oneapi/mpi/2021.9.0//lib/release/libmpi.so.12(+0x3b71c0) [0x1491591b71c0]
19 /software/intel/oneapi/mpi/2021.9.0//lib/release/libmpi.so.12(+0x722785) [0x149159522785]
20 /software/intel/oneapi/mpi/2021.9.0//lib/release/libmpi.so.12(+0x2a0153) [0x1491590a0153]
21 /software/intel/oneapi/mpi/2021.9.0//lib/release/libmpi.so.12(MPI_Scan+0x56e) [0x149158f8a40e]
22 /software/lammps/build/lmp(+0x203042) [0x5565946e0042]
23 /software/lammps/build/lmp(+0x2c0313) [0x55659479d313]
24 /software/lammps/build/lmp(+0xcf204) [0x5565945ac204]
25 /software/lammps/build/lmp(+0xcf616) [0x5565945ac616]
26 /software/lammps/build/lmp(+0xaddbc) [0x55659458adbc]
27 /lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x1491589d7d90]
28 /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80) [0x1491589d7e40]
29 /software/lammps/build/lmp(+0xaee25) [0x55659458be25]
=================================
==== backtrace (tid: 140692) ====
0 /lib/x86_64-linux-gnu/libucs.so.0(ucs_handle_error+0x2e4) [0x148c42ed5fc4]
1 /lib/x86_64-linux-gnu/libucs.so.0(+0x24fec) [0x148c42ed9fec]
2 /lib/x86_64-linux-gnu/libucs.so.0(+0x251aa) [0x148c42eda1aa]
3 /lib/x86_64-linux-gnu/libc.so.6(+0x42520) [0x148e2c9f0520]
4 /software/intel/oneapi/mpi/2021.9.0//libfabric/lib/prov/librxm-fi.so(+0x2e886) [0x148c4182e886]
5 /software/intel/oneapi/mpi/2021.9.0//libfabric/lib/prov/librxm-fi.so(+0x2e8a9) [0x148c4182e8a9]
6 /software/intel/oneapi/mpi/2021.9.0//libfabric/lib/prov/librxm-fi.so(+0xe9e5) [0x148c4180e9e5]
7 /software/intel/oneapi/mpi/2021.9.0//libfabric/lib/prov/librxm-fi.so(+0x13c17) [0x148c41813c17]
8 /software/intel/oneapi/mpi/2021.9.0//libfabric/lib/prov/librxm-fi.so(+0x14389) [0x148c41814389]
9 /software/intel/oneapi/mpi/2021.9.0//libfabric/lib/prov/librxm-fi.so(+0x150b0) [0x148c418150b0]
10 /software/intel/oneapi/mpi/2021.9.0//libfabric/lib/prov/librxm-fi.so(+0x15a9a) [0x148c41815a9a]
11 /software/intel/oneapi/mpi/2021.9.0//libfabric/lib/prov/librxm-fi.so(+0x16b23) [0x148c41816b23]
12 /software/intel/oneapi/mpi/2021.9.0//libfabric/lib/prov/librxm-fi.so(+0x16ce9) [0x148c41816ce9]
13 /software/intel/oneapi/mpi/2021.9.0//libfabric/lib/prov/librxm-fi.so(+0x31a6d) [0x148c41831a6d]
14 /software/intel/oneapi/mpi/2021.9.0//libfabric/lib/prov/librxm-fi.so(+0x319f7) [0x148c418319f7]
15 /software/intel/oneapi/mpi/2021.9.0//lib/release/libmpi.so.12(+0x666b8e) [0x148e2d466b8e]
16 /software/intel/oneapi/mpi/2021.9.0//lib/release/libmpi.so.12(+0x22b919) [0x148e2d02b919]
17 /software/intel/oneapi/mpi/2021.9.0//lib/release/libmpi.so.12(+0x7c658d) [0x148e2d5c658d]
18 /software/intel/oneapi/mpi/2021.9.0//lib/release/libmpi.so.12(+0x3b71c0) [0x148e2d1b71c0]
19 /software/intel/oneapi/mpi/2021.9.0//lib/release/libmpi.so.12(+0x722785) [0x148e2d522785]
20 /software/intel/oneapi/mpi/2021.9.0//lib/release/libmpi.so.12(+0x2a0153) [0x148e2d0a0153]
21 /software/intel/oneapi/mpi/2021.9.0//lib/release/libmpi.so.12(MPI_Scan+0x56e) [0x148e2cf8a40e]
22 /software/lammps/build/lmp(+0x203042) [0x556c1aa11042]
23 /software/lammps/build/lmp(+0x2c0313) [0x556c1aace313]
24 /software/lammps/build/lmp(+0xcf204) [0x556c1a8dd204]
25 /software/lammps/build/lmp(+0xcf616) [0x556c1a8dd616]
26 /software/lammps/build/lmp(+0xaddbc) [0x556c1a8bbdbc]
27 /lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x148e2c9d7d90]
28 /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80) [0x148e2c9d7e40]
29 /software/lammps/build/lmp(+0xaee25) [0x556c1a8bce25]
=================================
srun: error: dragon1: tasks 4-6,8,11,16,20-21,23,25,36-38,40,43,45,51-53,56: Segmentation fault
srun: error: dragon1: task 60: Segmentation fault (core dumped)
srun: error: dragon1: task 46: Segmentation fault (core dumped)
srun: error: dragon1: tasks 14,22,28,30,44,54,62: Segmentation fault (core dumped)
srun: error: dragon1: task 12: Segmentation fault (core dumped)
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: got SIGCONT
slurmstepd-dragon1: error: *** JOB 423 ON dragon1 CANCELLED AT 2023-07-11T13:48:01 ***
slurmstepd-dragon1: error: *** STEP 423.0 ON dragon1 CANCELLED AT 2023-07-11T13:48:01 ***
srun: forcing job termination
srun: error: dragon1: tasks 1-3,7,9-10,13,15,17-19,24,26-27,29,31-35,39,41-42,47-50,55,57-59,61,63: Terminated
srun: error: dragon2: tasks 64-127: Terminated
srun: error: dragon1: task 0: Terminated
root@bear:/data1/wexler/lammps-test#

 

 

0 Kudos
22 Replies
chris-wustl
Beginner
1,184 Views

Hi Aishwarya

 

 Sorry I was on vacation. It turns out that the network card we were using is not certified by Ubuntu. Dell is sending new cards. I will not know the result until they arrive. I would like to keep the case up until this issue is resolved in case this isn't the answer. But I totally understand if you need to close it. Really appreciate all the effort. If you can keep it open, I will respond back.

 

Thanks

 

Chris

 

 

0 Kudos
AishwaryaCV_Intel
Moderator
1,199 Views

Hi,


We have not heard back from you. This thread will no longer be monitored by Intel. If you need further assistance, please post a new question.


Thanks And Regards,

Aishwarya


0 Kudos
Reply