Intel® MPI Library
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.
Announcements
FPGA community forums and blogs have moved to the Altera Community. Existing Intel Community members can sign in with their current credentials.
2276 Discussions

Job Core Dumps when using Intel MPI over two nodes

chris-wustl
Beginner
11,316 Views

Hello

 

I am having issues when running a Slurm job using Intel OneAPI MPI 2019.9 over two nodes using sbatch. All nodes can run on one node successfully but when I utilize Intel MPI parallelization over two nodes the jobs core dumps. Slurm does not throw an error and the tasks are running on both nodes. I believe I am missing something, but I don't know what. I made sure I compiled the executables with OneAPI.  Script and Error Log below. Any suggestions would be appreciated. If you need more info please let me know.

Thanks

Chris

 

runscript - 

#!/bin/bash
#SBATCH --job-name=test
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=64
#SBATCH --mem-per-cpu=2G
#SBATCH --error=error-%j.err
#SBATCH --partition=dragon
#SBATCH --time=1:00:00
#SBATCH --account=wexler
#SBATCH --propagate=STACK

# Set MPI environment variables
export I_MPI_FABRICS=sockets
export I_MPI_FALLBACK=0

srun /software/lammps/build/lmp -in in.lj

 

Error Log - 

 

[dragon1:140740:0:140740] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))
==== backtrace (tid: 140740) ====
0 /lib/x86_64-linux-gnu/libucs.so.0(ucs_handle_error+0x2e4) [0x1530cd8e1fc4]
1 /lib/x86_64-linux-gnu/libucs.so.0(+0x24fec) [0x1530cd8e5fec]
2 /lib/x86_64-linux-gnu/libucs.so.0(+0x251aa) [0x1530cd8e61aa]
3 /lib/x86_64-linux-gnu/libc.so.6(+0x42520) [0x1532b74e9520]
4 /software/intel/oneapi/mpi/2021.9.0//libfabric/lib/prov/librxm-fi.so(+0x2e886) [0x1530cc22e886]
5 /software/intel/oneapi/mpi/2021.9.0//libfabric/lib/prov/librxm-fi.so(+0x2e8a9) [0x1530cc22e8a9]
6 /software/intel/oneapi/mpi/2021.9.0//libfabric/lib/prov/librxm-fi.so(+0xe9e5) [0x1530cc20e9e5]
7 /software/intel/oneapi/mpi/2021.9.0//libfabric/lib/prov/librxm-fi.so(+0x13c17) [0x1530cc213c17]
8 /software/intel/oneapi/mpi/2021.9.0//libfabric/lib/prov/librxm-fi.so(+0x14389) [0x1530cc214389]
9 /software/intel/oneapi/mpi/2021.9.0//libfabric/lib/prov/librxm-fi.so(+0x150b0) [0x1530cc2150b0]
10 /software/intel/oneapi/mpi/2021.9.0//libfabric/lib/prov/librxm-fi.so(+0x15a9a) [0x1530cc215a9a]
11 /software/intel/oneapi/mpi/2021.9.0//libfabric/lib/prov/librxm-fi.so(+0x16b23) [0x1530cc216b23]
12 /software/intel/oneapi/mpi/2021.9.0//libfabric/lib/prov/librxm-fi.so(+0x16ce9) [0x1530cc216ce9]
13 /software/intel/oneapi/mpi/2021.9.0//libfabric/lib/prov/librxm-fi.so(+0x31a6d) [0x1530cc231a6d]
14 /software/intel/oneapi/mpi/2021.9.0//libfabric/lib/prov/librxm-fi.so(+0x319f7) [0x1530cc2319f7]
15 /software/intel/oneapi/mpi/2021.9.0//lib/release/libmpi.so.12(+0x666b8e) [0x1532b8066b8e]
16 /software/intel/oneapi/mpi/2021.9.0//lib/release/libmpi.so.12(+0x22b919) [0x1532b7c2b919]
17 /software/intel/oneapi/mpi/2021.9.0//lib/release/libmpi.so.12(+0x7c658d) [0x1532b81c658d]
18 /software/intel/oneapi/mpi/2021.9.0//lib/release/libmpi.so.12(+0x3b71c0) [0x1532b7db71c0]
19 /software/intel/oneapi/mpi/2021.9.0//lib/release/libmpi.so.12(+0x722785) [0x1532b8122785]
20 /software/intel/oneapi/mpi/2021.9.0//lib/release/libmpi.so.12(+0x2a0153) [0x1532b7ca0153]
21 /software/intel/oneapi/mpi/2021.9.0//lib/release/libmpi.so.12(MPI_Scan+0x56e) [0x1532b7b8a40e]
22 /software/lammps/build/lmp(+0x203042) [0x55cc32ced042]
23 /software/lammps/build/lmp(+0x2c0313) [0x55cc32daa313]
24 /software/lammps/build/lmp(+0xcf204) [0x55cc32bb9204]
25 /software/lammps/build/lmp(+0xcf616) [0x55cc32bb9616]
26 /software/lammps/build/lmp(+0xaddbc) [0x55cc32b97dbc]
27 /lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x1532b74d0d90]
28 /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80) [0x1532b74d0e40]
29 /software/lammps/build/lmp(+0xaee25) [0x55cc32b98e25]
=================================
==== backtrace (tid: 140741) ====
0 /lib/x86_64-linux-gnu/libucs.so.0(ucs_handle_error+0x2e4) [0x14b9bfca1fc4]
1 /lib/x86_64-linux-gnu/libucs.so.0(+0x24fec) [0x14b9bfca5fec]
2 /lib/x86_64-linux-gnu/libucs.so.0(+0x251aa) [0x14b9bfca61aa]
3 /lib/x86_64-linux-gnu/libc.so.6(+0x42520) [0x14bba9509520]
4 /software/intel/oneapi/mpi/2021.9.0//libfabric/lib/prov/librxm-fi.so(+0x2e886) [0x14b9be22e886]
5 /software/intel/oneapi/mpi/2021.9.0//libfabric/lib/prov/librxm-fi.so(+0x2e8a9) [0x14b9be22e8a9]
6 /software/intel/oneapi/mpi/2021.9.0//libfabric/lib/prov/librxm-fi.so(+0xe9e5) [0x14b9be20e9e5]
7 /software/intel/oneapi/mpi/2021.9.0//libfabric/lib/prov/librxm-fi.so(+0x13c17) [0x14b9be213c17]
8 /software/intel/oneapi/mpi/2021.9.0//libfabric/lib/prov/librxm-fi.so(+0x14389) [0x14b9be214389]
9 /software/intel/oneapi/mpi/2021.9.0//libfabric/lib/prov/librxm-fi.so(+0x150b0) [0x14b9be2150b0]
10 /software/intel/oneapi/mpi/2021.9.0//libfabric/lib/prov/librxm-fi.so(+0x15a9a) [0x14b9be215a9a]
11 /software/intel/oneapi/mpi/2021.9.0//libfabric/lib/prov/librxm-fi.so(+0x16b23) [0x14b9be216b23]
12 /software/intel/oneapi/mpi/2021.9.0//libfabric/lib/prov/librxm-fi.so(+0x16ce9) [0x14b9be216ce9]
13 /software/intel/oneapi/mpi/2021.9.0//libfabric/lib/prov/librxm-fi.so(+0x31a6d) [0x14b9be231a6d]
14 /software/intel/oneapi/mpi/2021.9.0//libfabric/lib/prov/librxm-fi.so(+0x319f7) [0x14b9be2319f7]
15 /software/intel/oneapi/mpi/2021.9.0//lib/release/libmpi.so.12(+0x666b8e) [0x14bbaa066b8e]
16 /software/intel/oneapi/mpi/2021.9.0//lib/release/libmpi.so.12(+0x22b919) [0x14bba9c2b919]
17 /software/intel/oneapi/mpi/2021.9.0//lib/release/libmpi.so.12(+0x7c658d) [0x14bbaa1c658d]
18 /software/intel/oneapi/mpi/2021.9.0//lib/release/libmpi.so.12(+0x3b71c0) [0x14bba9db71c0]
19 /software/intel/oneapi/mpi/2021.9.0//lib/release/libmpi.so.12(+0x722785) [0x14bbaa122785]
20 /software/intel/oneapi/mpi/2021.9.0//lib/release/libmpi.so.12(+0x2a0153) [0x14bba9ca0153]
21 /software/intel/oneapi/mpi/2021.9.0//lib/release/libmpi.so.12(MPI_Scan+0x56e) [0x14bba9b8a40e]
22 /software/lammps/build/lmp(+0x203042) [0x55d5788d4042]
23 /software/lammps/build/lmp(+0x2c0313) [0x55d578991313]
24 /software/lammps/build/lmp(+0xcf204) [0x55d5787a0204]
25 /software/lammps/build/lmp(+0xcf616) [0x55d5787a0616]
26 /software/lammps/build/lmp(+0xaddbc) [0x55d57877edbc]
27 /lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x14bba94f0d90]
28 /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80) [0x14bba94f0e40]
29 /software/lammps/build/lmp(+0xaee25) [0x55d57877fe25]
=================================
[dragon1:140743:0:140743] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))
==== backtrace (tid: 140743) ====
0 /lib/x86_64-linux-gnu/libucs.so.0(ucs_handle_error+0x2e4) [0x148f6eed5fc4]
1 /lib/x86_64-linux-gnu/libucs.so.0(+0x24fec) [0x148f6eed9fec]
2 /lib/x86_64-linux-gnu/libucs.so.0(+0x251aa) [0x148f6eeda1aa]
3 /lib/x86_64-linux-gnu/libc.so.6(+0x42520) [0x1491589f0520]
4 /software/intel/oneapi/mpi/2021.9.0//libfabric/lib/prov/librxm-fi.so(+0x2e886) [0x148f6d82e886]
5 /software/intel/oneapi/mpi/2021.9.0//libfabric/lib/prov/librxm-fi.so(+0x2e8a9) [0x148f6d82e8a9]
6 /software/intel/oneapi/mpi/2021.9.0//libfabric/lib/prov/librxm-fi.so(+0xe9e5) [0x148f6d80e9e5]
7 /software/intel/oneapi/mpi/2021.9.0//libfabric/lib/prov/librxm-fi.so(+0x13c17) [0x148f6d813c17]
8 /software/intel/oneapi/mpi/2021.9.0//libfabric/lib/prov/librxm-fi.so(+0x14389) [0x148f6d814389]
9 /software/intel/oneapi/mpi/2021.9.0//libfabric/lib/prov/librxm-fi.so(+0x150b0) [0x148f6d8150b0]
10 /software/intel/oneapi/mpi/2021.9.0//libfabric/lib/prov/librxm-fi.so(+0x15a9a) [0x148f6d815a9a]
11 /software/intel/oneapi/mpi/2021.9.0//libfabric/lib/prov/librxm-fi.so(+0x16b23) [0x148f6d816b23]
12 /software/intel/oneapi/mpi/2021.9.0//libfabric/lib/prov/librxm-fi.so(+0x16ce9) [0x148f6d816ce9]
13 /software/intel/oneapi/mpi/2021.9.0//libfabric/lib/prov/librxm-fi.so(+0x31a6d) [0x148f6d831a6d]
14 /software/intel/oneapi/mpi/2021.9.0//libfabric/lib/prov/librxm-fi.so(+0x319f7) [0x148f6d8319f7]
15 /software/intel/oneapi/mpi/2021.9.0//lib/release/libmpi.so.12(+0x666b8e) [0x149159466b8e]
16 /software/intel/oneapi/mpi/2021.9.0//lib/release/libmpi.so.12(+0x22b919) [0x14915902b919]
17 /software/intel/oneapi/mpi/2021.9.0//lib/release/libmpi.so.12(+0x7c658d) [0x1491595c658d]
18 /software/intel/oneapi/mpi/2021.9.0//lib/release/libmpi.so.12(+0x3b71c0) [0x1491591b71c0]
19 /software/intel/oneapi/mpi/2021.9.0//lib/release/libmpi.so.12(+0x722785) [0x149159522785]
20 /software/intel/oneapi/mpi/2021.9.0//lib/release/libmpi.so.12(+0x2a0153) [0x1491590a0153]
21 /software/intel/oneapi/mpi/2021.9.0//lib/release/libmpi.so.12(MPI_Scan+0x56e) [0x149158f8a40e]
22 /software/lammps/build/lmp(+0x203042) [0x5565946e0042]
23 /software/lammps/build/lmp(+0x2c0313) [0x55659479d313]
24 /software/lammps/build/lmp(+0xcf204) [0x5565945ac204]
25 /software/lammps/build/lmp(+0xcf616) [0x5565945ac616]
26 /software/lammps/build/lmp(+0xaddbc) [0x55659458adbc]
27 /lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x1491589d7d90]
28 /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80) [0x1491589d7e40]
29 /software/lammps/build/lmp(+0xaee25) [0x55659458be25]
=================================
==== backtrace (tid: 140692) ====
0 /lib/x86_64-linux-gnu/libucs.so.0(ucs_handle_error+0x2e4) [0x148c42ed5fc4]
1 /lib/x86_64-linux-gnu/libucs.so.0(+0x24fec) [0x148c42ed9fec]
2 /lib/x86_64-linux-gnu/libucs.so.0(+0x251aa) [0x148c42eda1aa]
3 /lib/x86_64-linux-gnu/libc.so.6(+0x42520) [0x148e2c9f0520]
4 /software/intel/oneapi/mpi/2021.9.0//libfabric/lib/prov/librxm-fi.so(+0x2e886) [0x148c4182e886]
5 /software/intel/oneapi/mpi/2021.9.0//libfabric/lib/prov/librxm-fi.so(+0x2e8a9) [0x148c4182e8a9]
6 /software/intel/oneapi/mpi/2021.9.0//libfabric/lib/prov/librxm-fi.so(+0xe9e5) [0x148c4180e9e5]
7 /software/intel/oneapi/mpi/2021.9.0//libfabric/lib/prov/librxm-fi.so(+0x13c17) [0x148c41813c17]
8 /software/intel/oneapi/mpi/2021.9.0//libfabric/lib/prov/librxm-fi.so(+0x14389) [0x148c41814389]
9 /software/intel/oneapi/mpi/2021.9.0//libfabric/lib/prov/librxm-fi.so(+0x150b0) [0x148c418150b0]
10 /software/intel/oneapi/mpi/2021.9.0//libfabric/lib/prov/librxm-fi.so(+0x15a9a) [0x148c41815a9a]
11 /software/intel/oneapi/mpi/2021.9.0//libfabric/lib/prov/librxm-fi.so(+0x16b23) [0x148c41816b23]
12 /software/intel/oneapi/mpi/2021.9.0//libfabric/lib/prov/librxm-fi.so(+0x16ce9) [0x148c41816ce9]
13 /software/intel/oneapi/mpi/2021.9.0//libfabric/lib/prov/librxm-fi.so(+0x31a6d) [0x148c41831a6d]
14 /software/intel/oneapi/mpi/2021.9.0//libfabric/lib/prov/librxm-fi.so(+0x319f7) [0x148c418319f7]
15 /software/intel/oneapi/mpi/2021.9.0//lib/release/libmpi.so.12(+0x666b8e) [0x148e2d466b8e]
16 /software/intel/oneapi/mpi/2021.9.0//lib/release/libmpi.so.12(+0x22b919) [0x148e2d02b919]
17 /software/intel/oneapi/mpi/2021.9.0//lib/release/libmpi.so.12(+0x7c658d) [0x148e2d5c658d]
18 /software/intel/oneapi/mpi/2021.9.0//lib/release/libmpi.so.12(+0x3b71c0) [0x148e2d1b71c0]
19 /software/intel/oneapi/mpi/2021.9.0//lib/release/libmpi.so.12(+0x722785) [0x148e2d522785]
20 /software/intel/oneapi/mpi/2021.9.0//lib/release/libmpi.so.12(+0x2a0153) [0x148e2d0a0153]
21 /software/intel/oneapi/mpi/2021.9.0//lib/release/libmpi.so.12(MPI_Scan+0x56e) [0x148e2cf8a40e]
22 /software/lammps/build/lmp(+0x203042) [0x556c1aa11042]
23 /software/lammps/build/lmp(+0x2c0313) [0x556c1aace313]
24 /software/lammps/build/lmp(+0xcf204) [0x556c1a8dd204]
25 /software/lammps/build/lmp(+0xcf616) [0x556c1a8dd616]
26 /software/lammps/build/lmp(+0xaddbc) [0x556c1a8bbdbc]
27 /lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x148e2c9d7d90]
28 /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80) [0x148e2c9d7e40]
29 /software/lammps/build/lmp(+0xaee25) [0x556c1a8bce25]
=================================
srun: error: dragon1: tasks 4-6,8,11,16,20-21,23,25,36-38,40,43,45,51-53,56: Segmentation fault
srun: error: dragon1: task 60: Segmentation fault (core dumped)
srun: error: dragon1: task 46: Segmentation fault (core dumped)
srun: error: dragon1: tasks 14,22,28,30,44,54,62: Segmentation fault (core dumped)
srun: error: dragon1: task 12: Segmentation fault (core dumped)
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: got SIGCONT
slurmstepd-dragon1: error: *** JOB 423 ON dragon1 CANCELLED AT 2023-07-11T13:48:01 ***
slurmstepd-dragon1: error: *** STEP 423.0 ON dragon1 CANCELLED AT 2023-07-11T13:48:01 ***
srun: forcing job termination
srun: error: dragon1: tasks 1-3,7,9-10,13,15,17-19,24,26-27,29,31-35,39,41-42,47-50,55,57-59,61,63: Terminated
srun: error: dragon2: tasks 64-127: Terminated
srun: error: dragon1: task 0: Terminated
root@bear:/data1/wexler/lammps-test#

 

 

0 Kudos
22 Replies
AishwaryaCV_Intel
Moderator
10,048 Views

Hi,


Thank you for posting in intel community.


Could you please provide the following details :

1. OS and output with lscpu command

2. The sample reproducer along with steps to reproduce the issue at our end


Could you please try to run on the latest Intel MPI version(2021.9) exporting I_MPI_FABRICS = ofi and let us know if you still face the issue?


Thanks And Regards,

Aishwarya



0 Kudos
chris-wustl
Beginner
10,035 Views

Hi Aishwarya

 

Really appreciate the help. Let me know if you need anything else. 

 

1. I changed the I_MPI-FABRICS to ofi - received the same error output.

 

2. To reproduce simply goto:

 

https://www.lammps.org/download.html

Download. I compiled the code with the latest Intel OneAPi compiler. I will attach the input file to run lmp.

 

3. I installed Intel OneAPI basekit & HPCkit 2023.1 the latest version on the Intel website. Is there an updated version? I believe I am running the lastest version of MPI.

wexler@bear:/data1/wexler/lammps-test$ mpirun --version
Intel(R) MPI Library for Linux* OS, Version 2021.9 Build 20230307 (id: d82b3071db)
Copyright 2003-2023, Intel Corporation.

 

4. My OS is Ubuntu 22.04 with latest patch set.  Same on all nodes. Slurm version is the same on all nodes. 

 

5. lscpu output - 

 

Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 46 bits physical, 57 bits virtual
Byte Order: Little Endian
CPU(s): 128
On-line CPU(s) list: 0-127
Vendor ID: GenuineIntel
Model name: Intel(R) Xeon(R) Gold 6338 CPU @ 2.00GHz
CPU family: 6
Model: 106
Thread(s) per core: 2
Core(s) per socket: 32
Socket(s): 2
Stepping: 6
CPU max MHz: 3200.0000
CPU min MHz: 800.0000
BogoMIPS: 4000.00
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mc
a cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss
ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art
arch_perfmon pebs bts rep_good nopl xtopology nonstop_
tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cp
l vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dc
a sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer
aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpu
id_fault epb cat_l3 invpcid_single intel_ppin ssbd mba
ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexprior
ity ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep
bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx
smap avx512ifma clflushopt clwb intel_pt avx512cd sha_
ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm
_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local split_lo
ck_detect wbnoinvd dtherm ida arat pln pts avx512vbmi u
mip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_
vnni avx512_bitalg tme avx512_vpopcntdq la57 rdpid fsrm
md_clear pconfig flush_l1d arch_capabilities
Virtualization features:
Virtualization: VT-x
Caches (sum of all):
L1d: 3 MiB (64 instances)
L1i: 2 MiB (64 instances)
L2: 80 MiB (64 instances)
L3: 96 MiB (2 instances)
NUMA:
NUMA node(s): 2
NUMA node0 CPU(s): 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,
40,42,44,46,48,50,52,54,56,58,60,62,64,66,68,70,72,74,7
6,78,80,82,84,86,88,90,92,94,96,98,100,102,104,106,108,
110,112,114,116,118,120,122,124,126
NUMA node1 CPU(s): 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,
41,43,45,47,49,51,53,55,57,59,61,63,65,67,69,71,73,75,7
7,79,81,83,85,87,89,91,93,95,97,99,101,103,105,107,109,
111,113,115,117,119,121,123,125,127
Vulnerabilities:
Itlb multihit: Not affected
L1tf: Not affected
Mds: Not affected
Meltdown: Not affected
Mmio stale data: Mitigation; Clear CPU buffers; SMT vulnerable
Retbleed: Not affected
Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
and seccomp
Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer
sanitization
Spectre v2: Mitigation; Enhanced IBRS, IBPB conditional, RSB fillin
g, PBRSB-eIBRS SW sequence
Srbds: Not affected
Tsx async abort: Not affected

 

0 Kudos
chris-wustl
Beginner
10,024 Views

Hi Aishwarya

 

 I was going to send you the source for lammps and the executable but it exceeds 71MG. Is there another mechanism to get this tar of the source to you?

 

Chris

 

chriswustl_0-1689191069981.png

 

0 Kudos
AishwaryaCV_Intel
Moderator
9,908 Views

Hi,

 

We were successfully able to build and run the Lammps on two nodes downloaded from https://www.lammps.org/download.html .

Can find the attached script file(run1.zip) and the output log file(slurm-515403.zip).

 

we have run the script file with the following command line:

sbatch --partition workq run1.sh

 

Could you please let us know on which ofi provider you are running?

 

>>>>I was going to send you the source for lammps and the executable but it exceeds 71MG. Is there another mechanism to get this tar of the source to you?

Is this source file the same as the one provided in the link, or is it a different one?

 

Thanks And Regards,

Aishwarya

 

0 Kudos
chris-wustl
Beginner
9,888 Views

Hi Aishwarya

 

 I ran the script as above and received the same segmentation fault. I am running the same version of lammps you compiled. My slurm.conf file has the defaultmpi set to pmi2. I thought OneApi has ofi built into it when installing HPC Toolkit. I tried specifying different fabrics with the same result. Is there something I am missing? What else can I do to troubleshoot this issue?

 

Thanks

 

Chris

0 Kudos
chris-wustl
Beginner
9,881 Views

I tried multiple ways to get this to work and I think I found an issue (not sure). I turned on the I_MPI_DEBUG and set it to 5.  One note: We are using OneApi from a shared drive on all nodes. The head node with one of the worker nodes does work and produces this:

 

 [0] MPI startup(): Intel(R) MPI Library, Version 2021.9 Build 20230307 (id: d82b3071db)
[0] MPI startup(): Copyright (C) 2003-2023 Intel Corporation. All rights reserved.
[0] MPI startup(): library kind: release
[0] MPI startup(): libfabric version: 1.13.2rc1-impi
[0] MPI startup(): libfabric provider: tcp;ofi_rxm
[0] MPI startup(): File "/software/intel/oneapi/mpi/2021.9.0/etc/tuning_skx_shm-ofi_tcp-ofi-rxm_10.dat" not found
[0] MPI startup(): Load tuning file: "/software/intel/oneapi/mpi/2021.9.0/etc/tuning_skx_shm-ofi_tcp-ofi-rxm.dat"

lammps works

 

The output for two worker nodes is:

[0] MPI startup(): Intel(R) MPI Library, Version 2021.9 Build 20230307 (id: d82b3071db)
[0] MPI startup(): Copyright (C) 2003-2023 Intel Corporation. All rights reserved.
[0] MPI startup(): library kind: release
[0] MPI startup(): libfabric version: 1.13.2rc1-impi
[0] MPI startup(): libfabric provider: tcp;ofi_rxm

Segmentaion Fault - No other output

 

I have tried different Fabrics put it produces the same errors. 

 

Here's the sbatch script:

#!/bin/bash
#SBATCH --job-name=test
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=32
#SBATCH --error=error-%j.err
#SBATCH --partition=general
#SBATCH --time=1:00:00
#SBATCH --account=wexler
#SBATCH --propagate=STACK

# Set MPI environment variables
export I_MPI_FABRICS=shm
export I_MPI_FALLBACK=0
export I_MPI_DEBUG=5

srun /software/lammps-chris/build/lmp -in in.lj

 

Any ideas?

Thanks

Chris

 

 

 

0 Kudos
AishwaryaCV_Intel
Moderator
9,864 Views

Hi,

 

Could you please try to run following IMB benchmark command on your 2 nodes? And let us know the output of it?

mpirun -n 2 IMB-MPI1

 

Please refer the following link for IMB benchmark https://www.intel.com/content/www/us/en/docs/mpi-library/user-guide-benchmarks/2021-2/running-intel-r-mpi-benchmarks.html

 

Thanks And Regards,

Aishwarya

 

0 Kudos
chris-wustl
Beginner
9,853 Views

 

Output attached.

 

 

 

 

0 Kudos
AishwaryaCV_Intel
Moderator
9,743 Views

Hi,

 

Could you please try to run IMB benchmark along with the flags and I_MPI_DEBUG used in slurm script file as shown below and provide us the full output?

#!/bin/bash

#SBATCH --job-name=test
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=32
#SBATCH --partition=general
#SBATCH --time=1:00:00
#SBATCH --account=wexler
#SBATCH --propagate=STACK

#Set MPI environment variables
export I_MPI_FABRICS=shm:ofi
export I_MPI_DEBUG=120

srun IMB-MPI1

 

Thanks And Regards,

Aishwarya

 

0 Kudos
chris-wustl
Beginner
9,735 Views

Output attached.

 

Thanks

 

Chris

0 Kudos
AishwaryaCV_Intel
Moderator
9,696 Views

Hi,

 

Could you please try to run below slurm script and provide us the full output?

#!/bin/bash
#SBATCH --job-name=test
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=32
#SBATCH --partition=general
#SBATCH --time=1:00:00
#SBATCH --account=wexler
#SBATCH --propagate=STACK

#Set MPI environment variables
export I_MPI_FABRICS=shm:ofi
export I_MPI_DEBUG=120

mpirun -n 64 -ppn 32 IMB-MPI1 pingpong

Thanks And Regards,

Aishwarya

 

0 Kudos
chris-wustl
Beginner
9,677 Views

Output Attached

 

Thanks

 

Chris

0 Kudos
AishwaryaCV_Intel
Moderator
9,576 Views

Hi,

 

Could you please let us know if you have set up passwordless SSH on your machine and confirm that it is functioning correctly? Could you also please provide information about the interconnect and the current drivers you are using?

 

It seems that you are using hyperthreading. I would like to request you to disable it for your tasks. You can achieve this, for example as follows, by using SLURM's -c(--cpus-per-task) option when submitting your job:

 

#!/bin/bash
#SBATCH --job-name=test
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=32
#SBATCH --partition=general
#SBATCH --time=1:00:00
#SBATCH --account=wexler
#SBATCH --propagate=STACK
#SBATCH --cpus-per-task=1

#Set MPI environment variables
export I_MPI_FABRICS=shm:ofi
export I_MPI_DEBUG=120

mpirun -n 64 -ppn 32 IMB-MPI1 pingpong 

 

 

Thanks And Regards,

Aishwarya

 

0 Kudos
chris-wustl
Beginner
9,416 Views

Attached is the output requested. I am confused by results. I have all ports open for all servers see below: 

 

ufw status

 

Anywhere ALLOW 172.20.93.218
Anywhere ALLOW 172.20.93.219
Anywhere ALLOW 172.20.93.220
Anywhere ALLOW 172.20.93.221

Anywhere ALLOW 10.225.153.13

 

All servers are setup this way. 

 

What am I doing wrong?

 

Thanks

 

Chris

0 Kudos
chris-wustl
Beginner
9,547 Views

Hello

 

 I do not have passwordless ssh setup. I never read that was a requirement. Please advise if it's necessary.

 

Interconnect - TCP/IP Networking

 

Dell PowerEdge C6520

 

Network

Product - BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet Controller

Driver - 4b:00.0 Ethernet controller: Broadcom Inc. and subsidiaries BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet Controller (rev 01)

 

Right now, the group has multiple jobs running in the queue using all resources, but I will get you the output as soon as possible.

 

Thanks

 

Chris

 

 

0 Kudos
chris-wustl
Beginner
9,403 Views

I had some test time and turn off all the firewalls on all the nodes. Still an issue. Re-ran your last test script. Attached is the output.

 

Chris

0 Kudos
AishwaryaCV_Intel
Moderator
9,374 Views

Hi,

 

The crash is happening inside libucs, and I'm not sure if the calling is necessary?. Could you please let us know whether this is required for a specific feature? It's possible that the Ethernet card mandates the presence of libucs.I suggest the following steps:

 

  1. Ensure libucs is the latest version you are using.
  2. Try running with I_MPI_TUNING_BIN="" and explicitly setting I_MPI_OFI_PROVIDER=tcp
  3. Could you please let us know if your SLURM with other MPI implementations like OpenMPI or MPICH was able to run?


Thanks And Regards,

Aishwarya


0 Kudos
chris-wustl
Beginner
9,356 Views

Hello

 

 I don't know anything about libucs or whether it is needed. I do have ProSupport from Dell which includes Ubuntu support and could pose any question that would help. Please send me what I should ask Dell. I do have the latest version. 

 

I did re-run the script with the environment variables set in 2. That output is attached. 

 

No other implementations of MPI are installed. Professor would like to stick with an all Intel solution.

 

Chris

 

 

0 Kudos
AishwaryaCV_Intel
Moderator
9,307 Views

Hi,


The backtrace shows that the segfault originates in libucs, which is not part of our software stack. It suggests that there is something wrong with your drivers installation. Please check the compatible version of libucs with the interface you are using.


Could you please reach out to your supplier to confirm that your installation is up-to-date? Additionally, if possible, consider testing an alternative MPI implementation(OpenMP or MPICH). This will help us triage the issue whether its from IMPI or your installation itself.


Thanks And Regards,

Aishwarya


0 Kudos
AishwaryaCV_Intel
Moderator
9,208 Views

Hi,  

 

We have not heard back from you. This thread will no longer be monitored by Intel. If you need further assistance, please post a new question.

 

Thank you and best regards,

Aishwarya

 

0 Kudos
Reply