oneapi/mpi/2021.11 running simple hello world: select.c:629 UCX ERROR no active messages transport

RobbieTheK · ‎02-16-2024

Using oneAPI MPI 2021.11 on RHEL 9, Slurm 23, firewalld disabled on compute nodes, throws these errors:

[1708098932.837910] [2402-node004:311701:0]     select.c:629 UCX ERROR no active messages transport to <no debug data>: self/memory - Destination is unreachable
[1708098932.837927] [2402-node004:311702:0]     select.c:629 UCX ERROR no active messages transport to <no debug data>: self/memory - Destination is unreachable
[1708098932.837923] [2402-node004:311703:0]     select.c:629 UCX ERROR no active messages transport to <no debug data>: self/memory - Destination is unreachable
[1708098932.837953] [2402-node004:311700:0]     select.c:629 UCX ERROR no active messages transport to <no debug data>: self/memory - Destination is unreachable
Abort(1614479) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Unknown error class, error stack:
MPIR_Init_thread(189)............:
MPID_Init(1561)..................:
MPIDI_OFI_mpi_init_hook(1674)....:
insert_addr_table_roots_only(472): OFI get address vector map failed
Abort(1614479) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Unknown error class, error stack:
MPIR_Init_thread(189)............:
MPID_Init(1561)..................:
MPIDI_OFI_mpi_init_hook(1674)....:
insert_addr_table_roots_only(472): OFI get address vector map failed
Abort(1614479) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Unknown error class, error stack:
MPIR_Init_thread(189)............:
MPID_Init(1561)..................:
MPIDI_OFI_mpi_init_hook(1674)....:
insert_addr_table_roots_only(472): OFI get address vector map failed
[1708098932.837663] [2402-node005:351396:0]     select.c:629 UCX ERROR no active messages transport to <no debug data>: self/memory - Destination is unreachable
[1708098932.837669] [2402-node005:351397:0]     select.c:629 UCX ERROR no active messages transport to <no debug data>: self/memory - Destination is unreachable
[1708098932.837661] [2402-node005:351398:0]     select.c:629 UCX ERROR no active messages transport to <no debug data>: self/memory - Destination is unreachable

#include <stdio.h>

#include <mpi.h>

int main(int argc, char *argv[]) {

int rank, size;

MPI_Init(&argc, &argv);

MPI_Comm_rank(MPI_COMM_WORLD, &rank);

MPI_Comm_size(MPI_COMM_WORLD, &size);

printf("Hello from rank %d of %d\n", rank, size);

MPI_Finalize();

return 0;

}

simple sbatch:

#SBATCH --account=us # The account name for the job.

#SBATCH --job-name=HelloWorld # The job name.

#SBATCH -c 1 # The number of cpu cores to use.

#SBATCH --time=1:00 # The time the job will take to run (here, 1 min)

#SBATCH --mem-per-cpu=1gb # The memory the job will use per cpu core.

echo "Hello World"

sleep 10

date

TobiasK · ‎02-19-2024

@RobbieTheK

It seems your UCX is not correctly set up.

Can you please try

export FI_PROVIDER=psm3

?

RobbieTheK · ‎02-19-2024

Different errors:

[1708352680.858176533] ourclsuester:rank0.hello_mpi: Unable to alloc send buffer MR on mlx5_0: Cannot allocate memory

[1708352680.858195346] ourclsuester:rank0.hello_mpi: Unable to allocate UD send buffer pool

[1708352680.863279697] ourclsuester:rank1.hello_mpi: Unable to alloc send buffer MR on mlx5_0: Cannot allocate memory

[1708352680.863298469] ourclsuester:rank1.hello_mpi: Unable to allocate UD send buffer pool

[1708352680.866519379] ourclsuester:rank0.hello_mpi: Unable to alloc send buffer MR on mlx5_0: Cannot allocate memory

[1708352680.866528039] ourclsuester:rank0.hello_mpi: Unable to allocate UD send buffer pool

[1708352680.871564853] ourclsuester:rank1.hello_mpi: Unable to alloc send buffer MR on mlx5_0: Cannot allocate memory

[1708352680.871573228] ourclsuester:rank1.hello_mpi: Unable to allocate UD send buffer pool

[1708352680.874123954] ourclsuester:rank0.hello_mpi: Unable to alloc send buffer MR on mlx5_0: Cannot allocate memory

[1708352680.874132497] ourclsuester:rank0.hello_mpi: Unable to allocate UD send buffer pool

[1708352680.879124687] ourclsuester:rank1.hello_mpi: Unable to alloc send buffer MR on mlx5_0: Cannot allocate memory

[1708352680.879133269] ourclsuester:rank1.hello_mpi: Unable to allocate UD send buffer pool

[1708352680.888233115] ourclsuester:rank0.hello_mpi: Unable to alloc send buffer MR on mlx5_0: Cannot allocate memory

[1708352680.888249283] ourclsuester:rank0.hello_mpi: Unable to allocate UD send buffer pool

--------------------------------------------------------------------------

Open MPI failed an OFI Libfabric library call (fi_endpoint). This is highly

unusual; your job may behave unpredictably (and/or abort) after this.

Local host: ourcluster

Location: mtl_ofi_component.c:513

Error: Invalid argument (22)

--------------------------------------------------------------------------

[1708352680.892790816] ourclsuester:rank1.hello_mpi: Unable to alloc send buffer MR on mlx5_0: Cannot allocate memory

[1708352680.892807646] ourclsuester:rank1.hello_mpi: Unable to allocate UD send buffer pool

--------------------------------------------------------------------------

Open MPI failed an OFI Libfabric library call (fi_endpoint). This is highly

unusual; your job may behave unpredictably (and/or abort) after this.

Local host: ourcluster

Location: mtl_ofi_component.c:513

Error: Invalid argument (22)

--------------------------------------------------------------------------

[2402-login-001:1028107] common_ucx.c:162 Warning: UCX is unable to handle VM_UNMAP event. This may cause performance degradation or data corruption. Pls try adding --mca opal_common_ucx_opal_mem_hooks 1 to mpirun/oshrun command line to resolve this issue.

[2402-login-001:1028108] common_ucx.c:162 Warning: UCX is unable to handle VM_UNMAP event. This may cause performance degradation or data corruption. Pls try adding --mca opal_common_ucx_opal_mem_hooks 1 to mpirun/oshrun command line to resolve this issue.

Hello from rank 0 of 1

TobiasK · ‎02-19-2024

@RobbieTheK

The error message is from OpenMPI, we do not provide support for OpenMPI.

Probably your environment is not set up correctly, if you try to use Intel MPI.

RobbieTheK · ‎02-20-2024

Can you help troubleshoot the installation of the oneAPI Basic Toolkit? Using this simple Intel tutorial,:

ml oneapi/mpi/

Loading modulefiles version 2021.11

mpiicc -cc=icx sample.c

/path/to/oneapi/mpi/2021.11/bin/mpiicx: line 539: icx: command not found

These env vars are set:

printenv|grep -i mpi

__MODULES_LMALTNAME=oneapi/mpi/2021.11&as|oneapi/mpi/default&as|oneapi/mpi/latest

I_MPI_ROOT=/path/to/oneapi/mpi/2021.11

FI_PROVIDER_PATH=/path/to/oneapi/mpi/2021.11/opt/mpi/libfabric/lib/prov:/usr/lib64/libfabric

MANPATH=/path/to/oneapi/mpi/2021.11/share/man:/usr/share/man:/usr/share/lmod/lmod/share/man:

I_MPI_HYDRA_BOOTSTRAP_EXEC_EXTRA_ARGS=--external-launcher

__MODULES_LMCONFLICT=oneapi/mpi/2021.11&modulefiles32&modulefiles

CPLUS_INCLUDE_PATH=:/usr/lib64/openmpi/include/paraview:/usr/include/openmpi-x86_64/paraview/paraview

LIBRARY_PATH=/path/to/oneapi/mpi/2021.11/opt/mpi/libfabric/lib:/path/to/oneapi/mpi/2021.11/lib::/usr/lib64/openmpi/lib:/usr/lib64/openmpi/lib/paraview

LOADEDMODULES=oneapi/mpi/2021.11

CLASSPATH=/path/to/oneapi/mpi/2021.11/share/java/mpi.jar

LD_LIBRARY_PATH=/path/to/oneapi/mpi/2021.11/opt/mpi/libfabric/lib:/path/to/oneapi/mpi/2021.11/lib::/usr/lib64/openmpi/lib:/usr/lib64/openmpi/lib/paraview

PATH=/path/to/oneapi/mpi/2021.11/opt/mpi/libfabric/bin:/path/to/oneapi/mpi/2021.11/bin:/usr/share/Modules/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/usr/lib64/openmpi/bin

_LMFILES_=/path/to/modulefiles/oneapi/mpi/2021.11

C_INCLUDE_PATH=:/usr/lib64/openmpi/include/paraview:/usr/include/openmpi-x86_64/paraview/paraview

OMPI_MCA_plm_slurm_args=--external-launcher

CPATH=/path/to/oneap

What causes this? This thread seems relevant. Is something wrong with the modulefile?

TobiasK · ‎02-21-2024

@RobbieTheK

no in the other thread the user tried to use icc which we do not ship anymore.

You are loading the module file for mpi but no module file for the compiler hence the compiler is not found.

Assuming

ml oneapi/mpi

is to to load the module file, you may try

ml oneapi/compiler

as well, otherwise please reach out to the system admin and ask how the module system is set up. Usually you should have something like "module avail" to print all the possible module files available on your system.

RobbieTheK · ‎02-21-2024

Yes the modules were installed using the off line installer and are all there so what else could be causing this?

Oneapi/compiler does not have any mpi commands.

modulefiles ------- MATLAB/R2023b
Mathematica/14.0
anaconda/2023.09
oneapi/advisor/2024.0
oneapi/ccl/2021.11.2
oneapi/compiler-rt/2024.0.2
oneapi/compiler-rt32/2024.0.2
oneapi/compiler/2024.0.2
oneapi/compiler32/2024.0.2
oneapi/dal/2024.0.0
oneapi/debugger/2024.0.0
oneapi/dev-utilities/2024.0.0
oneapi/dnnl/3.3.0
oneapi/dpct/2024.0.0
oneapi/dpl/2022.3
oneapi/hpctoolkit/compiler-rt/2024.0.2
oneapi/hpctoolkit/compiler-rt32/2024.0.2
oneapi/hpctoolkit/compiler/2024.0.2
oneapi/hpctoolkit/compiler32/2024.0.2
oneapi/hpctoolkit/debugger/2024.0.0
oneapi/hpctoolkit/dev-utilities/2024.0.0
oneapi/hpctoolkit/dpl/2022.3
oneapi/hpctoolkit/ifort/2024.0.2
oneapi/hpctoolkit/ifort32/2024.0.2
oneapi/hpctoolkit/inspector/2024.0
oneapi/hpctoolkit/itac/2022.0
oneapi/hpctoolkit/mpi/2021.11
oneapi/hpctoolkit/oclfpga/2024.0.0
oneapi/hpctoolkit/tbb/2021.11
oneapi/hpctoolkit/tbb32/2021.11
oneapi/ifort/2024.0.2
oneapi/ifort32/2024.0.2
oneapi/intel_ipp_ia32/2021.10
oneapi/intel_ipp_intel64/2021.10
oneapi/intel_ippcp_ia32/2021.9
oneapi/intel_ippcp_intel64/2021.9
oneapi/mkl/2024.0
oneapi/mkl32/2024.0
oneapi/mpi/2021.11
oneapi/oclfpga/2024.0.0
oneapi/tbb/2021.11
oneapi/tbb32/2021.11
oneapi/vtune/2024.0
openmpi5/5.0.2

TobiasK · ‎02-22-2024

Please try to load oneapi/compiler/2024.0.2 and or oneapi/hpctoolkit/compiler/2024.0.2

again you need to load all the modules you want to work with, e.g. MPI for mpirun etc, compilers for compilation, etc...

RobbieTheK · ‎02-22-2024

ok that got me further but how it just hangs see below with verbose

mpirun -v -n 4 ./a.out
[mpiexec@server] Launch arguments: /usr/bin/srun -N 1 -n 1 --ntasks-per-node 1 --external-launcher --nodelist server --input none /cluster/shared/apps/oneapi/mpi/2021.11/bin//hydra_bstrap_proxy --upstream-host server --upstream-port 46325 --pgid 0 --launcher slurm --launcher-number 1 --base-path /cluster/shared/apps/oneapi/mpi/2021.11/bin/ --tree-width 16 --tree-level 1 --time-left -1 --launch-type 0 --debug /cluster/shared/apps/oneapi/mpi/2021.11/bin//hydra_pmi_proxy --usize -1 --auto-cleanup 1 --abort-signal 9
[proxy:0:0@server] pmi cmd from fd 4: cmd=init pmi_version=1 pmi_subversion=1
[proxy:0:0@server] PMI response: cmd=response_to_init pmi_version=1 pmi_subversion=1 rc=0
[proxy:0:0@server] pmi cmd from fd 4: cmd=get_maxes
[proxy:0:0@server] PMI response: cmd=maxes kvsname_max=256 keylen_max=64 vallen_max=4096
[proxy:0:0@server] pmi cmd from fd 4: cmd=get_appnum
[proxy:0:0@server] PMI response: cmd=appnum appnum=0
[proxy:0:0@server] pmi cmd from fd 4: cmd=get_my_kvsname
[proxy:0:0@server] PMI response: cmd=my_kvsname kvsname=kvs_1045579_0
[proxy:0:0@server] pmi cmd from fd 4: cmd=get kvsname=kvs_1045579_0 key=PMI_process_mapping
[proxy:0:0@server] PMI response: cmd=get_result rc=0 msg=success value=(vector,(0,1,4))
[proxy:0:0@server] pmi cmd from fd 5: cmd=init pmi_version=1 pmi_subversion=1
[proxy:0:0@server] PMI response: cmd=response_to_init pmi_version=1 pmi_subversion=1 rc=0
[proxy:0:0@server] pmi cmd from fd 5: cmd=get_maxes
[proxy:0:0@server] PMI response: cmd=maxes kvsname_max=256 keylen_max=64 vallen_max=4096
[proxy:0:0@server] pmi cmd from fd 5: cmd=get_appnum
[proxy:0:0@server] PMI response: cmd=appnum appnum=0
[proxy:0:0@server] pmi cmd from fd 5: cmd=get_my_kvsname
[proxy:0:0@server] PMI response: cmd=my_kvsname kvsname=kvs_1045579_0
[proxy:0:0@server] pmi cmd from fd 5: cmd=get kvsname=kvs_1045579_0 key=PMI_process_mapping
[proxy:0:0@server] PMI response: cmd=get_result rc=0 msg=success value=(vector,(0,1,4))
[proxy:0:0@server] pmi cmd from fd 6: cmd=init pmi_version=1 pmi_subversion=1
[proxy:0:0@server] PMI response: cmd=response_to_init pmi_version=1 pmi_subversion=1 rc=0
[proxy:0:0@server] pmi cmd from fd 9: cmd=init pmi_version=1 pmi_subversion=1
[proxy:0:0@server] PMI response: cmd=response_to_init pmi_version=1 pmi_subversion=1 rc=0
[proxy:0:0@server] pmi cmd from fd 6: cmd=get_maxes
[proxy:0:0@server] PMI response: cmd=maxes kvsname_max=256 keylen_max=64 vallen_max=4096
[proxy:0:0@server] pmi cmd from fd 9: cmd=get_maxes
[proxy:0:0@server] PMI response: cmd=maxes kvsname_max=256 keylen_max=64 vallen_max=4096
[proxy:0:0@server] pmi cmd from fd 6: cmd=get_appnum
[proxy:0:0@server] PMI response: cmd=appnum appnum=0
[proxy:0:0@server] pmi cmd from fd 9: cmd=get_appnum
[proxy:0:0@server] PMI response: cmd=appnum appnum=0
[proxy:0:0@server] pmi cmd from fd 6: cmd=get_my_kvsname
[proxy:0:0@server] PMI response: cmd=my_kvsname kvsname=kvs_1045579_0
[proxy:0:0@server] pmi cmd from fd 9: cmd=get_my_kvsname
[proxy:0:0@server] PMI response: cmd=my_kvsname kvsname=kvs_1045579_0
[proxy:0:0@server] pmi cmd from fd 9: cmd=get kvsname=kvs_1045579_0 key=PMI_process_mapping
[proxy:0:0@server] PMI response: cmd=get_result rc=0 msg=success value=(vector,(0,1,4))
[proxy:0:0@server] pmi cmd from fd 4: cmd=put kvsname=kvs_1045579_0 key=-bcast-1-0 value=2F6465762F73686D2F496E74656C5F4D50495F4D5352434E49
[proxy:0:0@server] PMI response: cmd=put_result rc=0 msg=success
[proxy:0:0@server] pmi cmd from fd 4: cmd=barrier_in
[proxy:0:0@server] pmi cmd from fd 5: cmd=barrier_in
[proxy:0:0@server] pmi cmd from fd 6: cmd=get kvsname=kvs_1045579_0 key=PMI_process_mapping
[proxy:0:0@server] PMI response: cmd=get_result rc=0 msg=success value=(vector,(0,1,4))
[proxy:0:0@server] pmi cmd from fd 9: cmd=barrier_in
[proxy:0:0@server] pmi cmd from fd 6: cmd=barrier_in
[proxy:0:0@server] PMI response: cmd=barrier_out
[proxy:0:0@server] PMI response: cmd=barrier_out
[proxy:0:0@server] PMI response: cmd=barrier_out
[proxy:0:0@server] PMI response: cmd=barrier_out
[proxy:0:0@server] pmi cmd from fd 5: cmd=get kvsname=kvs_1045579_0 key=-bcast-1-0
[proxy:0:0@server] PMI response: cmd=get_result rc=0 msg=success value=2F6465762F73686D2F496E74656C5F4D50495F4D5352434E49
[proxy:0:0@server] pmi cmd from fd 6: cmd=get kvsname=kvs_1045579_0 key=-bcast-1-0
[proxy:0:0@server] PMI response: cmd=get_result rc=0 msg=success value=2F6465762F73686D2F496E74656C5F4D50495F4D5352434E49
[proxy:0:0@server] pmi cmd from fd 9: cmd=get kvsname=kvs_1045579_0 key=-bcast-1-0
[proxy:0:0@server] PMI response: cmd=get_result rc=0 msg=success value=2F6465762F73686D2F496E74656C5F4D50495F4D5352434E49
[proxy:0:0@server] pmi cmd from fd 6: cmd=barrier_in
[proxy:0:0@server] pmi cmd from fd 5: cmd=barrier_in
[proxy:0:0@server] pmi cmd from fd 4: cmd=put kvsname=kvs_1045579_0 key=bc-0 value=mpi#03C073001B00000000000000000080FEA2694A0003D23FB80000000000000000$
[proxy:0:0@server] PMI response: cmd=put_result rc=0 msg=success
[proxy:0:0@server] pmi cmd from fd 4: cmd=barrier_in
[proxy:0:0@server] pmi cmd from fd 9: cmd=barrier_in
[proxy:0:0@server] PMI response: cmd=barrier_out
[proxy:0:0@server] PMI response: cmd=barrier_out
[proxy:0:0@server] PMI response: cmd=barrier_out
[proxy:0:0@server] PMI response: cmd=barrier_out
[proxy:0:0@server] pmi cmd from fd 9: cmd=get kvsname=kvs_1045579_0 key=bc-0
[proxy:0:0@server] PMI response: cmd=get_result rc=0 msg=success value=mpi#03C073001B00000000000000000080FEA2694A0003D23FB80000000000000000$

TobiasK · ‎02-26-2024

@RobbieTheK

did you verify that your slrum installation works without problems?

e.g. can you run a simple

srun hostname

inside your Slurm script and get the hostname of all nodes associated with your job?

Can you run a mpi job outside of slurm between the nodes?

Please also check the instructions on how to integrate Intel MPI and Slurm:

https://www.intel.com/content/www/us/en/docs/mpi-library/developer-guide-linux/2021-11/job-schedulers-support.html

RobbieTheK · ‎02-27-2024

Turns out there is an error in the example at the for loop causing it to run infinitely.

This:

for(i=start;i< end;i=i++)

should be this

for(i=start;i< end;i++)

Someone might want to fix the example.

Would there be a disadvantage to running the source command to setvars.sh? Could it affect speed? What are reasons for sourcing that vs via the modulefiles?

TobiasK · ‎02-28-2024

@RobbieTheK

thanks for reporting this, I will get it fixed.

No there is no difference between using modules or sourcing the environment. I prefer modules because you can unload them and have a more fine grained control vs the script which just loads all components.