- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Using oneAPI MPI 2021.11 on RHEL 9, Slurm 23, firewalld disabled on compute nodes, throws these errors:
[1708098932.837910] [2402-node004:311701:0] select.c:629 UCX ERROR no active messages transport to <no debug data>: self/memory - Destination is unreachable
[1708098932.837927] [2402-node004:311702:0] select.c:629 UCX ERROR no active messages transport to <no debug data>: self/memory - Destination is unreachable
[1708098932.837923] [2402-node004:311703:0] select.c:629 UCX ERROR no active messages transport to <no debug data>: self/memory - Destination is unreachable
[1708098932.837953] [2402-node004:311700:0] select.c:629 UCX ERROR no active messages transport to <no debug data>: self/memory - Destination is unreachable
Abort(1614479) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Unknown error class, error stack:
MPIR_Init_thread(189)............:
MPID_Init(1561)..................:
MPIDI_OFI_mpi_init_hook(1674)....:
insert_addr_table_roots_only(472): OFI get address vector map failed
Abort(1614479) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Unknown error class, error stack:
MPIR_Init_thread(189)............:
MPID_Init(1561)..................:
MPIDI_OFI_mpi_init_hook(1674)....:
insert_addr_table_roots_only(472): OFI get address vector map failed
Abort(1614479) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Unknown error class, error stack:
MPIR_Init_thread(189)............:
MPID_Init(1561)..................:
MPIDI_OFI_mpi_init_hook(1674)....:
insert_addr_table_roots_only(472): OFI get address vector map failed
[1708098932.837663] [2402-node005:351396:0] select.c:629 UCX ERROR no active messages transport to <no debug data>: self/memory - Destination is unreachable
[1708098932.837669] [2402-node005:351397:0] select.c:629 UCX ERROR no active messages transport to <no debug data>: self/memory - Destination is unreachable
[1708098932.837661] [2402-node005:351398:0] select.c:629 UCX ERROR no active messages transport to <no debug data>: self/memory - Destination is unreachable
#include <stdio.h>
#include <mpi.h>
int main(int argc, char *argv[]) {
int rank, size;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
printf("Hello from rank %d of %d\n", rank, size);
MPI_Finalize();
return 0;
}
simple sbatch:
#SBATCH --account=us # The account name for the job.
#SBATCH --job-name=HelloWorld # The job name.
#SBATCH -c 1 # The number of cpu cores to use.
#SBATCH --time=1:00 # The time the job will take to run (here, 1 min)
#SBATCH --mem-per-cpu=1gb # The memory the job will use per cpu core.
echo "Hello World"
sleep 10
date
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Different errors:
[1708352680.858176533] ourclsuester:rank0.hello_mpi: Unable to alloc send buffer MR on mlx5_0: Cannot allocate memory
[1708352680.858195346] ourclsuester:rank0.hello_mpi: Unable to allocate UD send buffer pool
[1708352680.863279697] ourclsuester:rank1.hello_mpi: Unable to alloc send buffer MR on mlx5_0: Cannot allocate memory
[1708352680.863298469] ourclsuester:rank1.hello_mpi: Unable to allocate UD send buffer pool
[1708352680.866519379] ourclsuester:rank0.hello_mpi: Unable to alloc send buffer MR on mlx5_0: Cannot allocate memory
[1708352680.866528039] ourclsuester:rank0.hello_mpi: Unable to allocate UD send buffer pool
[1708352680.871564853] ourclsuester:rank1.hello_mpi: Unable to alloc send buffer MR on mlx5_0: Cannot allocate memory
[1708352680.871573228] ourclsuester:rank1.hello_mpi: Unable to allocate UD send buffer pool
[1708352680.874123954] ourclsuester:rank0.hello_mpi: Unable to alloc send buffer MR on mlx5_0: Cannot allocate memory
[1708352680.874132497] ourclsuester:rank0.hello_mpi: Unable to allocate UD send buffer pool
[1708352680.879124687] ourclsuester:rank1.hello_mpi: Unable to alloc send buffer MR on mlx5_0: Cannot allocate memory
[1708352680.879133269] ourclsuester:rank1.hello_mpi: Unable to allocate UD send buffer pool
[1708352680.888233115] ourclsuester:rank0.hello_mpi: Unable to alloc send buffer MR on mlx5_0: Cannot allocate memory
[1708352680.888249283] ourclsuester:rank0.hello_mpi: Unable to allocate UD send buffer pool
--------------------------------------------------------------------------
Open MPI failed an OFI Libfabric library call (fi_endpoint). This is highly
unusual; your job may behave unpredictably (and/or abort) after this.
Local host: ourcluster
Location: mtl_ofi_component.c:513
Error: Invalid argument (22)
--------------------------------------------------------------------------
[1708352680.892790816] ourclsuester:rank1.hello_mpi: Unable to alloc send buffer MR on mlx5_0: Cannot allocate memory
[1708352680.892807646] ourclsuester:rank1.hello_mpi: Unable to allocate UD send buffer pool
--------------------------------------------------------------------------
Open MPI failed an OFI Libfabric library call (fi_endpoint). This is highly
unusual; your job may behave unpredictably (and/or abort) after this.
Local host: ourcluster
Location: mtl_ofi_component.c:513
Error: Invalid argument (22)
--------------------------------------------------------------------------
[2402-login-001:1028107] common_ucx.c:162 Warning: UCX is unable to handle VM_UNMAP event. This may cause performance degradation or data corruption. Pls try adding --mca opal_common_ucx_opal_mem_hooks 1 to mpirun/oshrun command line to resolve this issue.
[2402-login-001:1028108] common_ucx.c:162 Warning: UCX is unable to handle VM_UNMAP event. This may cause performance degradation or data corruption. Pls try adding --mca opal_common_ucx_opal_mem_hooks 1 to mpirun/oshrun command line to resolve this issue.
Hello from rank 0 of 1
Hello from rank 0 of 1
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The error message is from OpenMPI, we do not provide support for OpenMPI.
Probably your environment is not set up correctly, if you try to use Intel MPI.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Can you help troubleshoot the installation of the oneAPI Basic Toolkit? Using this simple Intel tutorial,:
ml oneapi/mpi/
Loading modulefiles version 2021.11
mpiicc -cc=icx sample.c
/path/to/oneapi/mpi/2021.11/bin/mpiicx: line 539: icx: command not found
These env vars are set:
printenv|grep -i mpi
__MODULES_LMALTNAME=oneapi/mpi/2021.11&as|oneapi/mpi/default&as|oneapi/mpi/latest
I_MPI_ROOT=/path/to/oneapi/mpi/2021.11
FI_PROVIDER_PATH=/path/to/oneapi/mpi/2021.11/opt/mpi/libfabric/lib/prov:/usr/lib64/libfabric
MANPATH=/path/to/oneapi/mpi/2021.11/share/man:/usr/share/man:/usr/share/lmod/lmod/share/man:
I_MPI_HYDRA_BOOTSTRAP_EXEC_EXTRA_ARGS=--external-launcher
__MODULES_LMCONFLICT=oneapi/mpi/2021.11&modulefiles32&modulefiles
CPLUS_INCLUDE_PATH=:/usr/lib64/openmpi/include/paraview:/usr/include/openmpi-x86_64/paraview/paraview
LIBRARY_PATH=/path/to/oneapi/mpi/2021.11/opt/mpi/libfabric/lib:/path/to/oneapi/mpi/2021.11/lib::/usr/lib64/openmpi/lib:/usr/lib64/openmpi/lib/paraview
LOADEDMODULES=oneapi/mpi/2021.11
CLASSPATH=/path/to/oneapi/mpi/2021.11/share/java/mpi.jar
LD_LIBRARY_PATH=/path/to/oneapi/mpi/2021.11/opt/mpi/libfabric/lib:/path/to/oneapi/mpi/2021.11/lib::/usr/lib64/openmpi/lib:/usr/lib64/openmpi/lib/paraview
PATH=/path/to/oneapi/mpi/2021.11/opt/mpi/libfabric/bin:/path/to/oneapi/mpi/2021.11/bin:/usr/share/Modules/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/usr/lib64/openmpi/bin
_LMFILES_=/path/to/modulefiles/oneapi/mpi/2021.11
C_INCLUDE_PATH=:/usr/lib64/openmpi/include/paraview:/usr/include/openmpi-x86_64/paraview/paraview
OMPI_MCA_plm_slurm_args=--external-launcher
CPATH=/path/to/oneap
What causes this? This thread seems relevant. Is something wrong with the modulefile?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
no in the other thread the user tried to use icc which we do not ship anymore.
You are loading the module file for mpi but no module file for the compiler hence the compiler is not found.
Assuming
ml oneapi/mpi
is to to load the module file, you may try
ml oneapi/compiler
as well, otherwise please reach out to the system admin and ask how the module system is set up. Usually you should have something like "module avail" to print all the possible module files available on your system.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Yes the modules were installed using the off line installer and are all there so what else could be causing this?
Oneapi/compiler does not have any mpi commands.
modulefiles ------- MATLAB/R2023b
Mathematica/14.0
anaconda/2023.09
oneapi/advisor/2024.0
oneapi/ccl/2021.11.2
oneapi/compiler-rt/2024.0.2
oneapi/compiler-rt32/2024.0.2
oneapi/compiler/2024.0.2
oneapi/compiler32/2024.0.2
oneapi/dal/2024.0.0
oneapi/debugger/2024.0.0
oneapi/dev-utilities/2024.0.0
oneapi/dnnl/3.3.0
oneapi/dpct/2024.0.0
oneapi/dpl/2022.3
oneapi/hpctoolkit/compiler-rt/2024.0.2
oneapi/hpctoolkit/compiler-rt32/2024.0.2
oneapi/hpctoolkit/compiler/2024.0.2
oneapi/hpctoolkit/compiler32/2024.0.2
oneapi/hpctoolkit/debugger/2024.0.0
oneapi/hpctoolkit/dev-utilities/2024.0.0
oneapi/hpctoolkit/dpl/2022.3
oneapi/hpctoolkit/ifort/2024.0.2
oneapi/hpctoolkit/ifort32/2024.0.2
oneapi/hpctoolkit/inspector/2024.0
oneapi/hpctoolkit/itac/2022.0
oneapi/hpctoolkit/mpi/2021.11
oneapi/hpctoolkit/oclfpga/2024.0.0
oneapi/hpctoolkit/tbb/2021.11
oneapi/hpctoolkit/tbb32/2021.11
oneapi/ifort/2024.0.2
oneapi/ifort32/2024.0.2
oneapi/intel_ipp_ia32/2021.10
oneapi/intel_ipp_intel64/2021.10
oneapi/intel_ippcp_ia32/2021.9
oneapi/intel_ippcp_intel64/2021.9
oneapi/mkl/2024.0
oneapi/mkl32/2024.0
oneapi/mpi/2021.11
oneapi/oclfpga/2024.0.0
oneapi/tbb/2021.11
oneapi/tbb32/2021.11
oneapi/vtune/2024.0
openmpi5/5.0.2
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Please try to load oneapi/compiler/2024.0.2 and or oneapi/hpctoolkit/compiler/2024.0.2
again you need to load all the modules you want to work with, e.g. MPI for mpirun etc, compilers for compilation, etc...
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
ok that got me further but how it just hangs see below with verbose
mpirun -v -n 4 ./a.out
[mpiexec@server] Launch arguments: /usr/bin/srun -N 1 -n 1 --ntasks-per-node 1 --external-launcher --nodelist server --input none /cluster/shared/apps/oneapi/mpi/2021.11/bin//hydra_bstrap_proxy --upstream-host server --upstream-port 46325 --pgid 0 --launcher slurm --launcher-number 1 --base-path /cluster/shared/apps/oneapi/mpi/2021.11/bin/ --tree-width 16 --tree-level 1 --time-left -1 --launch-type 0 --debug /cluster/shared/apps/oneapi/mpi/2021.11/bin//hydra_pmi_proxy --usize -1 --auto-cleanup 1 --abort-signal 9
[proxy:0:0@server] pmi cmd from fd 4: cmd=init pmi_version=1 pmi_subversion=1
[proxy:0:0@server] PMI response: cmd=response_to_init pmi_version=1 pmi_subversion=1 rc=0
[proxy:0:0@server] pmi cmd from fd 4: cmd=get_maxes
[proxy:0:0@server] PMI response: cmd=maxes kvsname_max=256 keylen_max=64 vallen_max=4096
[proxy:0:0@server] pmi cmd from fd 4: cmd=get_appnum
[proxy:0:0@server] PMI response: cmd=appnum appnum=0
[proxy:0:0@server] pmi cmd from fd 4: cmd=get_my_kvsname
[proxy:0:0@server] PMI response: cmd=my_kvsname kvsname=kvs_1045579_0
[proxy:0:0@server] pmi cmd from fd 4: cmd=get kvsname=kvs_1045579_0 key=PMI_process_mapping
[proxy:0:0@server] PMI response: cmd=get_result rc=0 msg=success value=(vector,(0,1,4))
[proxy:0:0@server] pmi cmd from fd 5: cmd=init pmi_version=1 pmi_subversion=1
[proxy:0:0@server] PMI response: cmd=response_to_init pmi_version=1 pmi_subversion=1 rc=0
[proxy:0:0@server] pmi cmd from fd 5: cmd=get_maxes
[proxy:0:0@server] PMI response: cmd=maxes kvsname_max=256 keylen_max=64 vallen_max=4096
[proxy:0:0@server] pmi cmd from fd 5: cmd=get_appnum
[proxy:0:0@server] PMI response: cmd=appnum appnum=0
[proxy:0:0@server] pmi cmd from fd 5: cmd=get_my_kvsname
[proxy:0:0@server] PMI response: cmd=my_kvsname kvsname=kvs_1045579_0
[proxy:0:0@server] pmi cmd from fd 5: cmd=get kvsname=kvs_1045579_0 key=PMI_process_mapping
[proxy:0:0@server] PMI response: cmd=get_result rc=0 msg=success value=(vector,(0,1,4))
[proxy:0:0@server] pmi cmd from fd 6: cmd=init pmi_version=1 pmi_subversion=1
[proxy:0:0@server] PMI response: cmd=response_to_init pmi_version=1 pmi_subversion=1 rc=0
[proxy:0:0@server] pmi cmd from fd 9: cmd=init pmi_version=1 pmi_subversion=1
[proxy:0:0@server] PMI response: cmd=response_to_init pmi_version=1 pmi_subversion=1 rc=0
[proxy:0:0@server] pmi cmd from fd 6: cmd=get_maxes
[proxy:0:0@server] PMI response: cmd=maxes kvsname_max=256 keylen_max=64 vallen_max=4096
[proxy:0:0@server] pmi cmd from fd 9: cmd=get_maxes
[proxy:0:0@server] PMI response: cmd=maxes kvsname_max=256 keylen_max=64 vallen_max=4096
[proxy:0:0@server] pmi cmd from fd 6: cmd=get_appnum
[proxy:0:0@server] PMI response: cmd=appnum appnum=0
[proxy:0:0@server] pmi cmd from fd 9: cmd=get_appnum
[proxy:0:0@server] PMI response: cmd=appnum appnum=0
[proxy:0:0@server] pmi cmd from fd 6: cmd=get_my_kvsname
[proxy:0:0@server] PMI response: cmd=my_kvsname kvsname=kvs_1045579_0
[proxy:0:0@server] pmi cmd from fd 9: cmd=get_my_kvsname
[proxy:0:0@server] PMI response: cmd=my_kvsname kvsname=kvs_1045579_0
[proxy:0:0@server] pmi cmd from fd 9: cmd=get kvsname=kvs_1045579_0 key=PMI_process_mapping
[proxy:0:0@server] PMI response: cmd=get_result rc=0 msg=success value=(vector,(0,1,4))
[proxy:0:0@server] pmi cmd from fd 4: cmd=put kvsname=kvs_1045579_0 key=-bcast-1-0 value=2F6465762F73686D2F496E74656C5F4D50495F4D5352434E49
[proxy:0:0@server] PMI response: cmd=put_result rc=0 msg=success
[proxy:0:0@server] pmi cmd from fd 4: cmd=barrier_in
[proxy:0:0@server] pmi cmd from fd 5: cmd=barrier_in
[proxy:0:0@server] pmi cmd from fd 6: cmd=get kvsname=kvs_1045579_0 key=PMI_process_mapping
[proxy:0:0@server] PMI response: cmd=get_result rc=0 msg=success value=(vector,(0,1,4))
[proxy:0:0@server] pmi cmd from fd 9: cmd=barrier_in
[proxy:0:0@server] pmi cmd from fd 6: cmd=barrier_in
[proxy:0:0@server] PMI response: cmd=barrier_out
[proxy:0:0@server] PMI response: cmd=barrier_out
[proxy:0:0@server] PMI response: cmd=barrier_out
[proxy:0:0@server] PMI response: cmd=barrier_out
[proxy:0:0@server] pmi cmd from fd 5: cmd=get kvsname=kvs_1045579_0 key=-bcast-1-0
[proxy:0:0@server] PMI response: cmd=get_result rc=0 msg=success value=2F6465762F73686D2F496E74656C5F4D50495F4D5352434E49
[proxy:0:0@server] pmi cmd from fd 6: cmd=get kvsname=kvs_1045579_0 key=-bcast-1-0
[proxy:0:0@server] PMI response: cmd=get_result rc=0 msg=success value=2F6465762F73686D2F496E74656C5F4D50495F4D5352434E49
[proxy:0:0@server] pmi cmd from fd 9: cmd=get kvsname=kvs_1045579_0 key=-bcast-1-0
[proxy:0:0@server] PMI response: cmd=get_result rc=0 msg=success value=2F6465762F73686D2F496E74656C5F4D50495F4D5352434E49
[proxy:0:0@server] pmi cmd from fd 6: cmd=barrier_in
[proxy:0:0@server] pmi cmd from fd 5: cmd=barrier_in
[proxy:0:0@server] pmi cmd from fd 4: cmd=put kvsname=kvs_1045579_0 key=bc-0 value=mpi#03C073001B00000000000000000080FEA2694A0003D23FB80000000000000000$
[proxy:0:0@server] PMI response: cmd=put_result rc=0 msg=success
[proxy:0:0@server] pmi cmd from fd 4: cmd=barrier_in
[proxy:0:0@server] pmi cmd from fd 9: cmd=barrier_in
[proxy:0:0@server] PMI response: cmd=barrier_out
[proxy:0:0@server] PMI response: cmd=barrier_out
[proxy:0:0@server] PMI response: cmd=barrier_out
[proxy:0:0@server] PMI response: cmd=barrier_out
[proxy:0:0@server] pmi cmd from fd 9: cmd=get kvsname=kvs_1045579_0 key=bc-0
[proxy:0:0@server] PMI response: cmd=get_result rc=0 msg=success value=mpi#03C073001B00000000000000000080FEA2694A0003D23FB80000000000000000$
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
did you verify that your slrum installation works without problems?
e.g. can you run a simple
srun hostname
inside your Slurm script and get the hostname of all nodes associated with your job?
Can you run a mpi job outside of slurm between the nodes?
Please also check the instructions on how to integrate Intel MPI and Slurm:
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Turns out there is an error in the example at the for loop causing it to run infinitely.
This:
for(i=start;i< end;i=i++)
should be this
for(i=start;i< end;i++)
Someone might want to fix the example.
Would there be a disadvantage to running the source command to setvars.sh? Could it affect speed? What are reasons for sourcing that vs via the modulefiles?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
thanks for reporting this, I will get it fixed.
No there is no difference between using modules or sourcing the environment. I prefer modules because you can unload them and have a more fine grained control vs the script which just loads all components.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page