Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.
Announcements
Welcome to the Intel Community. If you get an answer you like, please mark it as an Accepted Solution to help others. Thank you!
26758 Discussions

UCX error with Coarray Distributed Mode

SamM
Novice
842 Views

I've been trying to get a coarray distributed program to work on my local university cluster and have been running into a few problems.  I run the following code:

program hello_world
  use iso_fortran_env
  implicit none
  character(len=32) ::  hostname

  call get_environment_variable('HOSTNAME',hostname)
  write(*,'(2(a,i0),a)') "Hello from image: ", this_image(), " of ", &
                          num_images(), " on host: " // trim(hostname)
end program

 

and the submission script to the batch scheduler is:

#!/bin/bash
#SBATCH -t 00:10:00
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=10
#SBATCH --ntasks-per-core=1

module load intel/2020.2

export I_MPI_REMOVED_VAR_WARNING=0
export I_MPI_VAR_CHECK_SPELLING=0

ifort hello_world_intel.f90 -coarray=distributed -corray-config-file=caf_config.txt -o hwi

echo -genvall -genv I_MPI_FABRICS=shm:tcp -envall -n 20 ./hwi > caf_config.txt

./hwi > hwi.std.out

 

I get a bunch of errors that look like the following:

[1605965784.490902] [bhp0001:10682:0]          ib_md.c:438  UCX  ERROR ibv_reg_mr(address=0x1000, length=9223372036854771712, access=0xf) failed: Cannot allocate memory

[1605965784.490919] [bhp0001:10682:0]         ucp_mm.c:110  UCX  ERROR failed to register address 0x1000 length 9223372036854771711 on md[1]=ib/mlx4_0: Input/output error

 

I've tried it with both intel 2020.2 and 2019.2. Version 2019.2 prints out the hello world message, but 2020.2 does not. Both versions spit out the UCX errors.

Any ideas what the fix is?

0 Kudos
1 Solution
SamM
Novice
802 Views

It turns out that the default Intel MPI inter-node communication fabric  wasn't supported on my system. I experimented with the runtime settings and this is what ended up solving the issue. I needed to set the I_MPI_OFI_PROVIDER environment variable to select the right fabric. The docs here were useful.

#!/bin/bash
#SBATCH -t 00:10:00
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=10
#SBATCH --ntasks-per-core=1

module load intel/2020.2

export I_MPI_REMOVED_VAR_WARNING=0
export I_MPI_VAR_CHECK_SPELLING=0

ifort hello_world_intel.f90 -coarray=distributed -corray-config-file=caf_config.txt -o hwi

echo -genvall -genv I_MPI_FABRICS=shm:tcp -envall -n 20 ./hwi > caf_config.txt

# Not working versions(s)
# export I_MPI_OFI_PROVIDER=MLX
# export I_MPI_OFI_PROVIDER=PSM2

# Working version(s). Verbs is preferred here
# export I_MPI_OFI_PROVIDER=TCP
export I_MPI_OFI_PROVIDER=Verbs

./hwi > hwi.std.out

 

View solution in original post

4 Replies
Steve_Lionel
Black Belt Retired Employee
826 Views

Given the error messages, you are not building against Intel MPI but rather OpenUCX.

As such, the configuration is not supported by Intel. You could ask in the OpenUCX mailing list. I'd see first if you can get this running in shared mode using the provided Intel MPI library.

SamM
Novice
803 Views

It turns out that the default Intel MPI inter-node communication fabric  wasn't supported on my system. I experimented with the runtime settings and this is what ended up solving the issue. I needed to set the I_MPI_OFI_PROVIDER environment variable to select the right fabric. The docs here were useful.

#!/bin/bash
#SBATCH -t 00:10:00
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=10
#SBATCH --ntasks-per-core=1

module load intel/2020.2

export I_MPI_REMOVED_VAR_WARNING=0
export I_MPI_VAR_CHECK_SPELLING=0

ifort hello_world_intel.f90 -coarray=distributed -corray-config-file=caf_config.txt -o hwi

echo -genvall -genv I_MPI_FABRICS=shm:tcp -envall -n 20 ./hwi > caf_config.txt

# Not working versions(s)
# export I_MPI_OFI_PROVIDER=MLX
# export I_MPI_OFI_PROVIDER=PSM2

# Working version(s). Verbs is preferred here
# export I_MPI_OFI_PROVIDER=TCP
export I_MPI_OFI_PROVIDER=Verbs

./hwi > hwi.std.out

 

View solution in original post

Orion_P_
New Contributor I
676 Views

Thanks for this.  I was seeing the same thing with 2020.4 on our Scientific Linux 7.9 machines.

SamM
Novice
655 Views

I also found another working version later on after some extensive digging. I ended up installing ucx 1.9 via spack (spack.io ) and that allowed me to use MLX over infiniband. I don't think the minimum requirements of ucx for Intel MPI (1.6 in the manual I think) are high enough. All I know is that it took a while to get a version that worked right and used infiniband!

#!/bin/bash
#SBATCH -t 00:10:00
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=10
#SBATCH --ntasks-per-core=1

module load intel/2020.4

export I_MPI_REMOVED_VAR_WARNING=0
export I_MPI_VAR_CHECK_SPELLING=0

ifort hello_world_intel.f90 -coarray=distributed -corray-config-file=caf_config.txt -o hwi

# Make the config file; note the ucx settings
echo -genvall -genv UCX_TLS=rc,ud,cma,self -genv I_MPI_FABRICS=shm:ofi -n ${SLURM_NTASKS} -envall -n 20 ./hwi > caf_config.txt

# Check ucx settings
ucx_info -v

# MPI Settings
export I_MPI_DEBUG=100
export I_MPI_FALLBACK=0
export FI_PROVIDER=mlx

./hwi > hwi.std.out

 

Reply