- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
After installing oneAPI on a small cluster, when I try to run SLURM with srun, I get the following errors (just requesting 2 tasks here, and set I_MPI_DEBUG=100):
MPI startup(): Pinning environment could not be initialized correctly. Intel MPI process pinning will not be used.
Possible reason: Using Slurm's srun or other job submission commands from other job schedulers to launch your MPI job. In this case, job scheduler specified pinning will be used.
MPIR_pmi_virtualization(): MPI startup(): PMI calls are forwarded to /usr/lib64/libpmi2.so
Abort(1091087) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(138):
MPID_Init(996).......:
MPIR_pmi_init(168)...: PMI2_Job_GetId returned 14
MPI startup(): Pinning environment could not be initialized correctly. Intel MPI process pinning will not be used.
Possible reason: Using Slurm's srun or other job submission commands from other job schedulers to launch your MPI job. In this case, job scheduler specified pinning will be used.
MPIR_pmi_virtualization(): MPI startup(): PMI calls are forwarded to /usr/lib64/libpmi2.so
Abort(1091087) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(138):
MPID_Init(996).......:
MPIR_pmi_init(168)...: PMI2_Job_GetId returned 14
srun: error: node2: tasks 0-1: Exited with exit code 1
lscpu returns:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 20
On-line CPU(s) list: 0-19
Thread(s) per core: 2
Core(s) per socket: 10
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) Silver 4114 CPU @ 2.20GHz
Stepping: 4
CPU MHz: 800.024
CPU max MHz: 3000.0000
CPU min MHz: 800.0000
BogoMIPS: 4400.00
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 1024K
L3 cache: 14080K
NUMA node0 CPU(s): 0-19
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb cat_l3 cdp_l3 invpcid_single intel_ppin intel_pt mba tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts pku ospke
BTW, i posted about this a year ago, but I don't know what happened. I must have moved onto another project and the few responses I got from this forum must have gone to my spam box. Sorry about that.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Could you please try adding the option "--mpi=pmi2" at the time of running the slurm file:
export I_MPI_PMI_LIBRARY=<path-to-libpmi2.so>/libpmi2.so
srun --mpi=pmi2 ./myprog
Thanks & Regards,
Varsha
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
UPDATE
While I have yet to get SLURM srun to work, based on this link, https://www.intel.com/content/www/us/en/develop/documentation/mpi-developer-guide-linux/top/running-applications/job-schedulers-support.html, if I set I_MPI_PIN_RESPECT_CPUSET=0 and remove I_MPI_PMI_LIBRARY (I think Intel may set this), then if I just use mpirun, I can get SLURM to run without generating errors or warnings.
Seems I don't even need to set I_MPI_PIN_RESPECT_CPUSET=0, just launch
unset I_MPI_PMI_LIBRARY
mpirun ./executable
OK, almost. If I don't provide a -machinefile to mpirun, then it doesn't seem to get the right number of CPUs on a node (some nodes have fewer CPUs). So, I need mpirun -machinefile ./nodes.host /.excutable
where nodes.host is text file like
node1:10
node2:20
node3:20
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Thanks for posting in Intel Communities.
Could you please please let us know the commands you are using for the slurm(to take nodes) and which version of Intel MPI you are using?
And also, could you please provide us with the sample reproducer code you are using and the steps to reproduce the issue?
Please provide the complete debug log by setting I_MPI_DEBUG=30.
Thanks & Regards,
Varsha
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Varsha,
Thanks for looking into this. Here is the items you requested.
Simple Fortran MPI program (MPI_Slurm_problem.f90)
program MPI_SlURM
use mpi
implicit none
integer i
! mpi variables
integer nameLen, noProc, mpierr, myRank
character (len=MPI_MAX_PROCESSOR_NAME) nodeName
! Begin
! Initialize MPI
call MPI_INIT( mpierr )
call MPI_COMM_RANK(MPI_COMM_WORLD, myRank, mpierr) ! get rank of this process in world
call MPI_COMM_SIZE(MPI_COMM_WORLD, noProc, mpiErr)
if (myRank == 0) write(*,'(a,i0,a)') 'Running program with ',noProc, ' processes'
call MPI_Barrier(MPI_COMM_WORLD, mpiErr)
call MPI_GET_PROCESSOR_NAME(nodeName, nameLen, mpiErr)
do i=0,noProc-1
if (i == myRank) then
write(*,'(a,i4,2a)') 'Process ', myRank, ' is running on node: ', trim(nodeName)
end if
call MPI_Barrier(MPI_COMM_WORLD, mpiErr)
end do
call MPI_Barrier(MPI_COMM_WORLD, mpiErr)
call MPI_FINALIZE(mpierr)
end program MPI_SLURM
makefile
# Intel Fortran MPI compilier
FC = mpiifort
FFLAGSCPU = -O3 -qmkl -xHost -qno-openmp -ipo -qopt-matmul -o slurmTest
# Main program source files
DRVsrc=./MPI_Slurm_problem.f90
CPU: $(DRVSRC)
$(FC) $(FFLAGSCPU) $(DRVSRC)
SLURM file
#!/bin/bash
#
#SBATCH --job-name=Run1_slurmTest
#SBATCH --output=Run1_slurmTest.log
#
#SBATCH --ntasks=2
#SBATCH --cpus-per-task=1
#SBATCH --partition=my-mep
# These variables allow system load to show a decrease as processes finish their jobs
export I_MPI_THREAD_YIELD=3
export I_MPI_THREAD_SLEEP=100
export I_MPI_DEBUG=30 # use this for MPI debuging
export I_MPI_PMI_LIBRARY=/usr/lib64/libpmi2.so
srun ./slurmTest
Output from Run1_slurmTest.log
MPI startup(): Pinning environment could not be initialized correctly. Intel MPI process pinning will not be used.
Possible reason: Using Slurm's srun or other job submission commands from other job schedulers to launch your MPI job. In this case, job scheduler specified pinning will be used.
MPI startup(): Pinning environment could not be initialized correctly. Intel MPI process pinning will not be used.
Possible reason: Using Slurm's srun or other job submission commands from other job schedulers to launch your MPI job. In this case, job scheduler specified pinning will be used.
MPIR_pmi_virtualization(): MPI startup(): PMI calls are forwarded to /usr/lib64/libpmi2.so
MPIR_pmi_virtualization(): MPI startup(): PMI calls are forwarded to /usr/lib64/libpmi2.so
Abort(1091087) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(138):
MPID_Init(996).......:
MPIR_pmi_init(168)...: PMI2_Job_GetId returned 14
Abort(1091087) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(138):
MPID_Init(996).......:
MPIR_pmi_init(168)...: PMI2_Job_GetId returned 14
srun: error: node2: tasks 0-1: Exited with exit code 1
MPI version: Intel(R) MPI Library, Version 2021.4 Build 20210831 (id: 758087adf)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Any ideas on this problem? If not, I'll start a ticket on support.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Thanks for providing the required files.
Could you please provide us the commands if you are submitting a job using sbatch or running the slurm script?
And also, could you please provide us with the complete debug log along with the libfabric provider you are using(mlx/psm2/tcp)?
Thanks & Regards,
Varsha
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Varsha,
The slurm submission is just
$ sbatch slurmTest.slurm # slurmTest.slurm is the bash file above
The fabric is just TCP.
Other than the log file from the Slurm given output above, what file has the slurm debug log?
cheers,
-joe
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Thanks for providing the details.
We are working on your issue. Meanwhile, could you please provide the details of the cluster you are using by using the below command?
clck -F health_user
Thanks & Regards,
Varsha
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Varsha, attached are the log files from
>clck -F health_user --nodefile nodes.txt
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Could you please try adding the option "--mpi=pmi2" at the time of running the slurm file:
export I_MPI_PMI_LIBRARY=<path-to-libpmi2.so>/libpmi2.so
srun --mpi=pmi2 ./myprog
Thanks & Regards,
Varsha
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
That works! thank you very much Varsha!
cheers,
-joe
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Thanks for accepting the solution. Glad to know that your issue is resolved. If you need any additional information, please post a new question as this thread will no longer be monitored by Intel.
And also, you are eligible for priority support, you can open a support ticket in the Online service center(https://www.intel.com/content/www/us/en/developer/get-help/priority-support.html) for direct 1:1 support.
Thanks & Regards,
Varsha
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page