- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I've been trying to get a coarray distributed program to work on my local university cluster and have been running into a few problems. I run the following code:
program hello_world
use iso_fortran_env
implicit none
character(len=32) :: hostname
call get_environment_variable('HOSTNAME',hostname)
write(*,'(2(a,i0),a)') "Hello from image: ", this_image(), " of ", &
num_images(), " on host: " // trim(hostname)
end program
and the submission script to the batch scheduler is:
#!/bin/bash
#SBATCH -t 00:10:00
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=10
#SBATCH --ntasks-per-core=1
module load intel/2020.2
export I_MPI_REMOVED_VAR_WARNING=0
export I_MPI_VAR_CHECK_SPELLING=0
ifort hello_world_intel.f90 -coarray=distributed -corray-config-file=caf_config.txt -o hwi
echo -genvall -genv I_MPI_FABRICS=shm:tcp -envall -n 20 ./hwi > caf_config.txt
./hwi > hwi.std.out
I get a bunch of errors that look like the following:
[1605965784.490902] [bhp0001:10682:0] ib_md.c:438 UCX ERROR ibv_reg_mr(address=0x1000, length=9223372036854771712, access=0xf) failed: Cannot allocate memory
[1605965784.490919] [bhp0001:10682:0] ucp_mm.c:110 UCX ERROR failed to register address 0x1000 length 9223372036854771711 on md[1]=ib/mlx4_0: Input/output error
I've tried it with both intel 2020.2 and 2019.2. Version 2019.2 prints out the hello world message, but 2020.2 does not. Both versions spit out the UCX errors.
Any ideas what the fix is?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It turns out that the default Intel MPI inter-node communication fabric wasn't supported on my system. I experimented with the runtime settings and this is what ended up solving the issue. I needed to set the I_MPI_OFI_PROVIDER environment variable to select the right fabric. The docs here were useful.
#!/bin/bash
#SBATCH -t 00:10:00
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=10
#SBATCH --ntasks-per-core=1
module load intel/2020.2
export I_MPI_REMOVED_VAR_WARNING=0
export I_MPI_VAR_CHECK_SPELLING=0
ifort hello_world_intel.f90 -coarray=distributed -corray-config-file=caf_config.txt -o hwi
echo -genvall -genv I_MPI_FABRICS=shm:tcp -envall -n 20 ./hwi > caf_config.txt
# Not working versions(s)
# export I_MPI_OFI_PROVIDER=MLX
# export I_MPI_OFI_PROVIDER=PSM2
# Working version(s). Verbs is preferred here
# export I_MPI_OFI_PROVIDER=TCP
export I_MPI_OFI_PROVIDER=Verbs
./hwi > hwi.std.out
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Given the error messages, you are not building against Intel MPI but rather OpenUCX.
As such, the configuration is not supported by Intel. You could ask in the OpenUCX mailing list. I'd see first if you can get this running in shared mode using the provided Intel MPI library.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It turns out that the default Intel MPI inter-node communication fabric wasn't supported on my system. I experimented with the runtime settings and this is what ended up solving the issue. I needed to set the I_MPI_OFI_PROVIDER environment variable to select the right fabric. The docs here were useful.
#!/bin/bash
#SBATCH -t 00:10:00
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=10
#SBATCH --ntasks-per-core=1
module load intel/2020.2
export I_MPI_REMOVED_VAR_WARNING=0
export I_MPI_VAR_CHECK_SPELLING=0
ifort hello_world_intel.f90 -coarray=distributed -corray-config-file=caf_config.txt -o hwi
echo -genvall -genv I_MPI_FABRICS=shm:tcp -envall -n 20 ./hwi > caf_config.txt
# Not working versions(s)
# export I_MPI_OFI_PROVIDER=MLX
# export I_MPI_OFI_PROVIDER=PSM2
# Working version(s). Verbs is preferred here
# export I_MPI_OFI_PROVIDER=TCP
export I_MPI_OFI_PROVIDER=Verbs
./hwi > hwi.std.out
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for this. I was seeing the same thing with 2020.4 on our Scientific Linux 7.9 machines.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I also found another working version later on after some extensive digging. I ended up installing ucx 1.9 via spack (spack.io ) and that allowed me to use MLX over infiniband. I don't think the minimum requirements of ucx for Intel MPI (1.6 in the manual I think) are high enough. All I know is that it took a while to get a version that worked right and used infiniband!
#!/bin/bash
#SBATCH -t 00:10:00
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=10
#SBATCH --ntasks-per-core=1
module load intel/2020.4
export I_MPI_REMOVED_VAR_WARNING=0
export I_MPI_VAR_CHECK_SPELLING=0
ifort hello_world_intel.f90 -coarray=distributed -corray-config-file=caf_config.txt -o hwi
# Make the config file; note the ucx settings
echo -genvall -genv UCX_TLS=rc,ud,cma,self -genv I_MPI_FABRICS=shm:ofi -n ${SLURM_NTASKS} -envall -n 20 ./hwi > caf_config.txt
# Check ucx settings
ucx_info -v
# MPI Settings
export I_MPI_DEBUG=100
export I_MPI_FALLBACK=0
export FI_PROVIDER=mlx
./hwi > hwi.std.out
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page