Intel® MPI Library
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.
2154 Discussions

SLURM and oneAPI problems

j0e
New Contributor I
3,372 Views

After installing oneAPI on a small cluster, when I try to run SLURM with srun, I get the following errors (just requesting 2 tasks here, and set I_MPI_DEBUG=100):

 

MPI startup(): Pinning environment could not be initialized correctly. Intel MPI process pinning will not be used.
               Possible reason: Using Slurm's srun or other job submission commands from other job schedulers to launch your MPI job. In this case, job scheduler specified pinning will be used.
MPIR_pmi_virtualization(): MPI startup(): PMI calls are forwarded to /usr/lib64/libpmi2.so
Abort(1091087) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(138): 
MPID_Init(996).......: 
MPIR_pmi_init(168)...: PMI2_Job_GetId returned 14
MPI startup(): Pinning environment could not be initialized correctly. Intel MPI process pinning will not be used.
               Possible reason: Using Slurm's srun or other job submission commands from other job schedulers to launch your MPI job. In this case, job scheduler specified pinning will be used.
MPIR_pmi_virtualization(): MPI startup(): PMI calls are forwarded to /usr/lib64/libpmi2.so
Abort(1091087) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(138): 
MPID_Init(996).......: 
MPIR_pmi_init(168)...: PMI2_Job_GetId returned 14
srun: error: node2: tasks 0-1: Exited with exit code 1

 

 lscpu returns:

 

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                20
On-line CPU(s) list:   0-19
Thread(s) per core:    2
Core(s) per socket:    10
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 85
Model name:            Intel(R) Xeon(R) Silver 4114 CPU @ 2.20GHz
Stepping:              4
CPU MHz:               800.024
CPU max MHz:           3000.0000
CPU min MHz:           800.0000
BogoMIPS:              4400.00
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              1024K
L3 cache:              14080K
NUMA node0 CPU(s):     0-19
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb cat_l3 cdp_l3 invpcid_single intel_ppin intel_pt mba tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts pku ospke

BTW, i posted about this a year ago, but I don't know what happened. I must have moved onto another project and the few responses I got from this forum must have gone to my spam box. Sorry about that. 

 

Labels (1)
0 Kudos
1 Solution
VarshaS_Intel
Moderator
3,046 Views

Hi,

 

Could you please try adding the option "--mpi=pmi2" at the time of running the slurm file:

export I_MPI_PMI_LIBRARY=<path-to-libpmi2.so>/libpmi2.so 
srun --mpi=pmi2 ./myprog 

 

Thanks & Regards,

Varsha

 

View solution in original post

0 Kudos
11 Replies
j0e
New Contributor I
3,363 Views

UPDATE

While I have yet to get SLURM srun to work, based on this link, https://www.intel.com/content/www/us/en/develop/documentation/mpi-developer-guide-linux/top/running-applications/job-schedulers-support.html, if I set I_MPI_PIN_RESPECT_CPUSET=0 and remove I_MPI_PMI_LIBRARY (I think Intel may set this), then if I just use mpirun, I can get SLURM to run without generating errors or warnings.

 

Seems I don't even need to set I_MPI_PIN_RESPECT_CPUSET=0, just launch 

unset I_MPI_PMI_LIBRARY

mpirun ./executable

 

OK, almost. If I don't provide a -machinefile to mpirun, then it doesn't seem to get the right number of CPUs on a node (some nodes have fewer CPUs). So, I need mpirun -machinefile ./nodes.host /.excutable

where nodes.host is text file like

node1:10

node2:20

node3:20

 

 

0 Kudos
VarshaS_Intel
Moderator
3,354 Views

Hi,


Thanks for posting in Intel Communities.


Could you please please let us know the commands you are using for the slurm(to take nodes) and which version of Intel MPI you are using?


And also, could you please provide us with the sample reproducer code you are using and the steps to reproduce the issue?


Please provide the complete debug log by setting I_MPI_DEBUG=30.


Thanks & Regards,

Varsha



0 Kudos
j0e
New Contributor I
3,346 Views

Hi Varsha,

Thanks for looking into this. Here is the items you requested.

Simple Fortran MPI program (MPI_Slurm_problem.f90)

   program MPI_SlURM
      use mpi
      implicit none
      integer i
      ! mpi variables
      integer nameLen, noProc, mpierr, myRank
      character (len=MPI_MAX_PROCESSOR_NAME) nodeName
      
      ! Begin
      ! Initialize MPI
      call MPI_INIT( mpierr )
      call MPI_COMM_RANK(MPI_COMM_WORLD, myRank, mpierr) ! get rank of this process in world        
      
      call MPI_COMM_SIZE(MPI_COMM_WORLD, noProc, mpiErr)
      if (myRank == 0) write(*,'(a,i0,a)') 'Running program with ',noProc, ' processes'      
      
      call MPI_Barrier(MPI_COMM_WORLD, mpiErr)
      call MPI_GET_PROCESSOR_NAME(nodeName, nameLen, mpiErr)
      do i=0,noProc-1
         if (i == myRank) then
            write(*,'(a,i4,2a)') 'Process ', myRank, ' is running on node: ', trim(nodeName)
         end if
         call MPI_Barrier(MPI_COMM_WORLD, mpiErr)
      end do  
      call MPI_Barrier(MPI_COMM_WORLD, mpiErr) 
      call MPI_FINALIZE(mpierr)          
   end program MPI_SLURM

 makefile

# Intel Fortran MPI compilier
FC = mpiifort
FFLAGSCPU = -O3 -qmkl -xHost -qno-openmp -ipo -qopt-matmul -o slurmTest

# Main program source files 
DRVsrc=./MPI_Slurm_problem.f90

CPU: $(DRVSRC)
	$(FC) $(FFLAGSCPU) $(DRVSRC)

SLURM file

#!/bin/bash
#
#SBATCH --job-name=Run1_slurmTest
#SBATCH --output=Run1_slurmTest.log
#
#SBATCH --ntasks=2
#SBATCH --cpus-per-task=1
#SBATCH --partition=my-mep

# These variables allow system load to show a decrease as processes finish their jobs
export I_MPI_THREAD_YIELD=3
export I_MPI_THREAD_SLEEP=100
export I_MPI_DEBUG=30 # use this for MPI debuging
export I_MPI_PMI_LIBRARY=/usr/lib64/libpmi2.so

srun ./slurmTest

Output from Run1_slurmTest.log

MPI startup(): Pinning environment could not be initialized correctly. Intel MPI process pinning will not be used.
               Possible reason: Using Slurm's srun or other job submission commands from other job schedulers to launch your MPI job. In this case, job scheduler specified pinning will be used.
MPI startup(): Pinning environment could not be initialized correctly. Intel MPI process pinning will not be used.
               Possible reason: Using Slurm's srun or other job submission commands from other job schedulers to launch your MPI job. In this case, job scheduler specified pinning will be used.
MPIR_pmi_virtualization(): MPI startup(): PMI calls are forwarded to /usr/lib64/libpmi2.so
MPIR_pmi_virtualization(): MPI startup(): PMI calls are forwarded to /usr/lib64/libpmi2.so
Abort(1091087) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(138): 
MPID_Init(996).......: 
MPIR_pmi_init(168)...: PMI2_Job_GetId returned 14
Abort(1091087) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(138): 
MPID_Init(996).......: 
MPIR_pmi_init(168)...: PMI2_Job_GetId returned 14
srun: error: node2: tasks 0-1: Exited with exit code 1

MPI version: Intel(R) MPI Library, Version 2021.4 Build 20210831 (id: 758087adf)

0 Kudos
j0e
New Contributor I
3,306 Views

Any ideas on this problem? If not, I'll start a ticket on support.

0 Kudos
VarshaS_Intel
Moderator
3,255 Views

Hi,


Thanks for providing the required files.


Could you please provide us the commands if you are submitting a job using sbatch or running the slurm script?


And also, could you please provide us with the complete debug log along with the libfabric provider you are using(mlx/psm2/tcp)?


Thanks & Regards,

Varsha


0 Kudos
j0e
New Contributor I
3,205 Views

Hi Varsha,

 

The slurm submission is just

$ sbatch slurmTest.slurm # slurmTest.slurm is the bash file above

 

The fabric is just TCP.

 

Other than the log file from the Slurm given output above, what file has the slurm debug log?

 

cheers,

-joe

0 Kudos
VarshaS_Intel
Moderator
3,105 Views

Hi,

 

Thanks for providing the details.

 

We are working on your issue. Meanwhile, could you please provide the details of the cluster you are using by using the below command?

clck -F health_user

Thanks & Regards,

Varsha

 

0 Kudos
j0e
New Contributor I
3,094 Views

Hi Varsha, attached are the log files from 

>clck -F health_user --nodefile nodes.txt

0 Kudos
VarshaS_Intel
Moderator
3,047 Views

Hi,

 

Could you please try adding the option "--mpi=pmi2" at the time of running the slurm file:

export I_MPI_PMI_LIBRARY=<path-to-libpmi2.so>/libpmi2.so 
srun --mpi=pmi2 ./myprog 

 

Thanks & Regards,

Varsha

 

0 Kudos
j0e
New Contributor I
3,029 Views

That works! thank you very much Varsha!

cheers,

-joe

0 Kudos
VarshaS_Intel
Moderator
3,011 Views

Hi,


Thanks for accepting the solution. Glad to know that your issue is resolved. If you need any additional information, please post a new question as this thread will no longer be monitored by Intel.


And also, you are eligible for priority support, you can open a support ticket in the Online service center(https://www.intel.com/content/www/us/en/developer/get-help/priority-support.html) for direct 1:1 support.


Thanks & Regards,

Varsha


0 Kudos
Reply