- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
hi,
I had a problem using intelmpi and slurm
cpuinfo:
===== Processor composition =====
Processor name : Intel(R) Xeon(R) E5-2650 v2
Packages(sockets) : 2
Cores : 16
Processors(CPUs) : 32
Cores per package : 8
Threads per core : 2
slurm:
Slurm is configured with 30 cpu
Start intelmpi with slurm:
#SBATCH --partition=compute
#SBATCH --nodes=1
#SBATCH --time=0-24:00
#SBATCH --ntasks-per-node=30
#SBATCH --exclusive
mpiexec -genv I_MPI_PIN_PROCESSOR_LIST=all:map=scatter -genv I_MPI_DEBUG=16 -genv I_MPI_PIN=1 ./executable file
result:
[root@head test_slurm]# cat slurm-108.out
Removing mpi version 2021.1.1
Loading mpi version 2021.1.1
The following have been reloaded with a version change:
1) mpi/2021.1.1 => mpi/latest
[0] MPI startup(): Intel(R) MPI Library, Version 2021.1 Build 20201112 (id: b9c9d2fc5)
[0] MPI startup(): Copyright (C) 2003-2020 Intel Corporation. All rights reserved.
[0] MPI startup(): library kind: release
Assertion failed in file ../../src/mpid/ch4/shm/posix/eager/include/intel_transport.c at line 389: llc_id >= 0
/opt/intel/oneapi/mpi/2021.1.1/lib/release/libmpi.so.12(MPL_backtrace_show+0x1c) [0x7f412715259c]
/opt/intel/oneapi/mpi/2021.1.1/lib/release/libmpi.so.12(MPIR_Assert_fail+0x21) [0x7f4126a625f1]
/opt/intel/oneapi/mpi/2021.1.1/lib/release/libmpi.so.12(+0x45531e) [0x7f41269dc31e]
/opt/intel/oneapi/mpi/2021.1.1/lib/release/libmpi.so.12(+0x87b982) [0x7f4126e02982]
/opt/intel/oneapi/mpi/2021.1.1/lib/release/libmpi.so.12(+0x6925ab) [0x7f4126c195ab]
/opt/intel/oneapi/mpi/2021.1.1/lib/release/libmpi.so.12(+0x789186) [0x7f4126d10186]
/opt/intel/oneapi/mpi/2021.1.1/lib/release/libmpi.so.12(+0x1a7432) [0x7f412672e432]
/opt/intel/oneapi/mpi/2021.1.1/lib/release/libmpi.so.12(+0x43e666) [0x7f41269c5666]
/opt/intel/oneapi/mpi/2021.1.1/lib/release/libmpi.so.12(MPI_Init+0xcb) [0x7f41269c4c7b]
/opt/intel/oneapi/mpi/2021.1.1/lib/libmpifort.so.12(MPI_INIT+0x1b) [0x7f41279f6d9b]
/home/hysintelmpi/NPB3.4-MZ/NPB3.4-MZ-MPI/bin/bt-mz.C.x() [0x42b4b1]
/home/hysintelmpi/NPB3.4-MZ/NPB3.4-MZ-MPI/bin/bt-mz.C.x() [0x404512]
/home/hysintelmpi/NPB3.4-MZ/NPB3.4-MZ-MPI/bin/bt-mz.C.x() [0x4044a2]
/lib64/libc.so.6(__libc_start_main+0xf3) [0x7f41254326a3]
/home/hysintelmpi/NPB3.4-MZ/NPB3.4-MZ-MPI/bin/bt-mz.C.x() [0x4043ae]
Abort(1) on node 16: Internal error
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 0 PID 207600 RUNNING AT c1
= KILLED BY SIGNAL: 9 (Killed)
===================================================================================
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 1 PID 207601 RUNNING AT c1
= KILLED BY SIGNAL: 9 (Killed)
===================================================================================
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 2 PID 207602 RUNNING AT c1
= KILLED BY SIGNAL: 9 (Killed)
===================================================================================
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 3 PID 207603 RUNNING AT c1
= KILLED BY SIGNAL: 9 (Killed)
===================================================================================
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 4 PID 207604 RUNNING AT c1
= KILLED BY SIGNAL: 9 (Killed)
===================================================================================
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 5 PID 207605 RUNNING AT c1
= KILLED BY SIGNAL: 9 (Killed)
===================================================================================
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 6 PID 207606 RUNNING AT c1
= KILLED BY SIGNAL: 9 (Killed)
===================================================================================
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 7 PID 207607 RUNNING AT c1
= KILLED BY SIGNAL: 9 (Killed)
===================================================================================
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 8 PID 207608 RUNNING AT c1
= KILLED BY SIGNAL: 9 (Killed)
===================================================================================
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 9 PID 207609 RUNNING AT c1
= KILLED BY SIGNAL: 9 (Killed)
===================================================================================
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 10 PID 207610 RUNNING AT c1
= KILLED BY SIGNAL: 9 (Killed)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Please tell us more about your environment details like the interconnect and provider you were using?
Let us know the value set for the I_MPI_PMI_LIBRARY environment variable? if it is empty try setting the I_MPI_PMI_LIBRARY environment variable to the Slurm Process Management Interface (PMI) library:
eg : export I_MPI_PMI_LIBRARY=/path/to/slurm/pmi/library/libpmi.so
Could you once try to check with srun instead of mpiexec?
Please check if you are facing the same error while submitting an interactive job?
Regards
Prasanth
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Please tell us more about your environment details like the interconnect and provider you were using?
Let us know the value set for the I_MPI_PMI_LIBRARY environment variable? if it is empty try setting the I_MPI_PMI_LIBRARY environment variable to the Slurm Process Management Interface (PMI) library:
eg : export I_MPI_PMI_LIBRARY=/path/to/slurm/pmi/library/libpmi.so
Could you once try to check with srun instead of mpiexec?
Please check if you are facing the same error while submitting an interactive job?
Regards
Prasanth
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
We haven't heard back from you.
Have you tried the given alternatives?
If you are still facing the issue try using this command
mpirun -bootstrap slurm -n <num_procs> a.out
Let us know if it helps
Regards
Prasanth
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
We are closing this thread assuming your issue has been resolved.
We will no longer respond to this thread. If you require additional assistance from Intel, please start a new thread. Any further interaction in this thread will be considered community only
Regards
Prasanth
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Thanks for the confirmation.
As your issue has been resolved, we are closing this thread. If you require additional assistance from Intel, please start a new thread. Any further interaction in this thread will be considered community only.
Regards
Prasanth
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page