Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Highlighted
Beginner
75 Views

UCX error with Coarray Distributed Mode

Jump to solution

I've been trying to get a coarray distributed program to work on my local university cluster and have been running into a few problems.  I run the following code:

program hello_world
  use iso_fortran_env
  implicit none
  character(len=32) ::  hostname

  call get_environment_variable('HOSTNAME',hostname)
  write(*,'(2(a,i0),a)') "Hello from image: ", this_image(), " of ", &
                          num_images(), " on host: " // trim(hostname)
end program

 

and the submission script to the batch scheduler is:

#!/bin/bash
#SBATCH -t 00:10:00
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=10
#SBATCH --ntasks-per-core=1

module load intel/2020.2

export I_MPI_REMOVED_VAR_WARNING=0
export I_MPI_VAR_CHECK_SPELLING=0

ifort hello_world_intel.f90 -coarray=distributed -corray-config-file=caf_config.txt -o hwi

echo -genvall -genv I_MPI_FABRICS=shm:tcp -envall -n 20 ./hwi > caf_config.txt

./hwi > hwi.std.out

 

I get a bunch of errors that look like the following:

[1605965784.490902] [bhp0001:10682:0]          ib_md.c:438  UCX  ERROR ibv_reg_mr(address=0x1000, length=9223372036854771712, access=0xf) failed: Cannot allocate memory

[1605965784.490919] [bhp0001:10682:0]         ucp_mm.c:110  UCX  ERROR failed to register address 0x1000 length 9223372036854771711 on md[1]=ib/mlx4_0: Input/output error

 

I've tried it with both intel 2020.2 and 2019.2. Version 2019.2 prints out the hello world message, but 2020.2 does not. Both versions spit out the UCX errors.

Any ideas what the fix is?

0 Kudos

Accepted Solutions
Highlighted
Beginner
35 Views

It turns out that the default Intel MPI inter-node communication fabric  wasn't supported on my system. I experimented with the runtime settings and this is what ended up solving the issue. I needed to set the I_MPI_OFI_PROVIDER environment variable to select the right fabric. The docs here were useful.

#!/bin/bash
#SBATCH -t 00:10:00
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=10
#SBATCH --ntasks-per-core=1

module load intel/2020.2

export I_MPI_REMOVED_VAR_WARNING=0
export I_MPI_VAR_CHECK_SPELLING=0

ifort hello_world_intel.f90 -coarray=distributed -corray-config-file=caf_config.txt -o hwi

echo -genvall -genv I_MPI_FABRICS=shm:tcp -envall -n 20 ./hwi > caf_config.txt

# Not working versions(s)
# export I_MPI_OFI_PROVIDER=MLX
# export I_MPI_OFI_PROVIDER=PSM2

# Working version(s). Verbs is preferred here
# export I_MPI_OFI_PROVIDER=TCP
export I_MPI_OFI_PROVIDER=Verbs

./hwi > hwi.std.out

 

View solution in original post

2 Replies
Highlighted
Black Belt Retired Employee
59 Views

Given the error messages, you are not building against Intel MPI but rather OpenUCX.

As such, the configuration is not supported by Intel. You could ask in the OpenUCX mailing list. I'd see first if you can get this running in shared mode using the provided Intel MPI library.

0 Kudos
Highlighted
Beginner
36 Views

It turns out that the default Intel MPI inter-node communication fabric  wasn't supported on my system. I experimented with the runtime settings and this is what ended up solving the issue. I needed to set the I_MPI_OFI_PROVIDER environment variable to select the right fabric. The docs here were useful.

#!/bin/bash
#SBATCH -t 00:10:00
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=10
#SBATCH --ntasks-per-core=1

module load intel/2020.2

export I_MPI_REMOVED_VAR_WARNING=0
export I_MPI_VAR_CHECK_SPELLING=0

ifort hello_world_intel.f90 -coarray=distributed -corray-config-file=caf_config.txt -o hwi

echo -genvall -genv I_MPI_FABRICS=shm:tcp -envall -n 20 ./hwi > caf_config.txt

# Not working versions(s)
# export I_MPI_OFI_PROVIDER=MLX
# export I_MPI_OFI_PROVIDER=PSM2

# Working version(s). Verbs is preferred here
# export I_MPI_OFI_PROVIDER=TCP
export I_MPI_OFI_PROVIDER=Verbs

./hwi > hwi.std.out

 

View solution in original post