Community
cancel
Showing results for 
Search instead for 
Did you mean: 
侯玉山
Novice
230 Views

I had a problem using intelmpi and slurm

Jump to solution

hi,

I had a problem using intelmpi and slurm

cpuinfo:

===== Processor composition =====
Processor name : Intel(R) Xeon(R) E5-2650 v2
Packages(sockets) : 2
Cores : 16
Processors(CPUs) : 32
Cores per package : 8
Threads per core : 2

 

slurm:

Slurm is configured with 30 cpu

 

Start intelmpi with slurm:

#SBATCH --partition=compute
#SBATCH --nodes=1
#SBATCH --time=0-24:00
#SBATCH --ntasks-per-node=30
#SBATCH --exclusive

mpiexec -genv I_MPI_PIN_PROCESSOR_LIST=all:map=scatter -genv I_MPI_DEBUG=16 -genv I_MPI_PIN=1 ./executable file

 

result:

[root@head test_slurm]# cat slurm-108.out
Removing mpi version 2021.1.1
Loading mpi version 2021.1.1

The following have been reloaded with a version change:
1) mpi/2021.1.1 => mpi/latest

[0] MPI startup(): Intel(R) MPI Library, Version 2021.1 Build 20201112 (id: b9c9d2fc5)
[0] MPI startup(): Copyright (C) 2003-2020 Intel Corporation. All rights reserved.
[0] MPI startup(): library kind: release
Assertion failed in file ../../src/mpid/ch4/shm/posix/eager/include/intel_transport.c at line 389: llc_id >= 0
/opt/intel/oneapi/mpi/2021.1.1/lib/release/libmpi.so.12(MPL_backtrace_show+0x1c) [0x7f412715259c]
/opt/intel/oneapi/mpi/2021.1.1/lib/release/libmpi.so.12(MPIR_Assert_fail+0x21) [0x7f4126a625f1]
/opt/intel/oneapi/mpi/2021.1.1/lib/release/libmpi.so.12(+0x45531e) [0x7f41269dc31e]
/opt/intel/oneapi/mpi/2021.1.1/lib/release/libmpi.so.12(+0x87b982) [0x7f4126e02982]
/opt/intel/oneapi/mpi/2021.1.1/lib/release/libmpi.so.12(+0x6925ab) [0x7f4126c195ab]
/opt/intel/oneapi/mpi/2021.1.1/lib/release/libmpi.so.12(+0x789186) [0x7f4126d10186]
/opt/intel/oneapi/mpi/2021.1.1/lib/release/libmpi.so.12(+0x1a7432) [0x7f412672e432]
/opt/intel/oneapi/mpi/2021.1.1/lib/release/libmpi.so.12(+0x43e666) [0x7f41269c5666]
/opt/intel/oneapi/mpi/2021.1.1/lib/release/libmpi.so.12(MPI_Init+0xcb) [0x7f41269c4c7b]
/opt/intel/oneapi/mpi/2021.1.1/lib/libmpifort.so.12(MPI_INIT+0x1b) [0x7f41279f6d9b]
/home/hysintelmpi/NPB3.4-MZ/NPB3.4-MZ-MPI/bin/bt-mz.C.x() [0x42b4b1]
/home/hysintelmpi/NPB3.4-MZ/NPB3.4-MZ-MPI/bin/bt-mz.C.x() [0x404512]
/home/hysintelmpi/NPB3.4-MZ/NPB3.4-MZ-MPI/bin/bt-mz.C.x() [0x4044a2]
/lib64/libc.so.6(__libc_start_main+0xf3) [0x7f41254326a3]
/home/hysintelmpi/NPB3.4-MZ/NPB3.4-MZ-MPI/bin/bt-mz.C.x() [0x4043ae]
Abort(1) on node 16: Internal error

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 0 PID 207600 RUNNING AT c1
= KILLED BY SIGNAL: 9 (Killed)
===================================================================================

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 1 PID 207601 RUNNING AT c1
= KILLED BY SIGNAL: 9 (Killed)
===================================================================================

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 2 PID 207602 RUNNING AT c1
= KILLED BY SIGNAL: 9 (Killed)
===================================================================================

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 3 PID 207603 RUNNING AT c1
= KILLED BY SIGNAL: 9 (Killed)
===================================================================================

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 4 PID 207604 RUNNING AT c1
= KILLED BY SIGNAL: 9 (Killed)
===================================================================================

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 5 PID 207605 RUNNING AT c1
= KILLED BY SIGNAL: 9 (Killed)
===================================================================================

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 6 PID 207606 RUNNING AT c1
= KILLED BY SIGNAL: 9 (Killed)
===================================================================================

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 7 PID 207607 RUNNING AT c1
= KILLED BY SIGNAL: 9 (Killed)
===================================================================================

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 8 PID 207608 RUNNING AT c1
= KILLED BY SIGNAL: 9 (Killed)
===================================================================================

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 9 PID 207609 RUNNING AT c1
= KILLED BY SIGNAL: 9 (Killed)
===================================================================================

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 10 PID 207610 RUNNING AT c1
= KILLED BY SIGNAL: 9 (Killed)

0 Kudos

Accepted Solutions
PrasanthD_intel
Moderator
189 Views

Hi,


Please tell us more about your environment details like the interconnect and provider you were using?


Let us know the value set for the I_MPI_PMI_LIBRARY environment variable? if it is empty try setting the I_MPI_PMI_LIBRARY environment variable to the Slurm Process Management Interface (PMI) library:

eg : export I_MPI_PMI_LIBRARY=/path/to/slurm/pmi/library/libpmi.so


Could you once try to check with srun instead of mpiexec?

Please check if you are facing the same error while submitting an interactive job?


Regards

Prasanth


View solution in original post

5 Replies
PrasanthD_intel
Moderator
190 Views

Hi,


Please tell us more about your environment details like the interconnect and provider you were using?


Let us know the value set for the I_MPI_PMI_LIBRARY environment variable? if it is empty try setting the I_MPI_PMI_LIBRARY environment variable to the Slurm Process Management Interface (PMI) library:

eg : export I_MPI_PMI_LIBRARY=/path/to/slurm/pmi/library/libpmi.so


Could you once try to check with srun instead of mpiexec?

Please check if you are facing the same error while submitting an interactive job?


Regards

Prasanth


View solution in original post

PrasanthD_intel
Moderator
162 Views

Hi,


We haven't heard back from you.

Have you tried the given alternatives?

If you are still facing the issue try using this command

mpirun -bootstrap slurm -n <num_procs> a.out


Let us know if it helps

Regards

Prasanth


PrasanthD_intel
Moderator
136 Views

Hi,


We are closing this thread assuming your issue has been resolved.

We will no longer respond to this thread. If you require additional assistance from Intel, please start a new thread. Any further interaction in this thread will be considered community only


Regards

Prasanth


侯玉山
Novice
116 Views
Hello! Thank you for your help. The problem has been resolved.
PrasanthD_intel
Moderator
106 Views

Hi,

 

Thanks for the confirmation.

As your issue has been resolved, we are closing this thread. If you require additional assistance from Intel, please start a new thread. Any further interaction in this thread will be considered community only.

 

Regards

Prasanth

 

Reply