Intel® MPI Library
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.
2161 Discussions

Errors using Intel MPI distributed coarrays over InfiniBand with MLX

as14
Beginner
1,539 Views

Hi, I have problems when using mlx to communicate over infiniband for Intel Coarray Fortran as mentioned here: https://www.intel.com/content/www/us/en/developer/articles/technical/improve-performance-and-stability-with-intel-mpi-library-on-infiniband.html

 

My code works for I_MPI_OFI_PROVIDER=verbs, but hangs when using I_MPI_OFI_PROVIDER=mlx. It also hangs when using FI_PROVIDER=mlx. 

 

I'm using intel-oneapi-mpi/2021.4.0, intel-oneapi-compilers/2022.0.2 and ucx/1.12.1.


Setup:

echo '-n 2 ./a.out' > config.caf
ifort -coarray=distributed -coarray-config-file=config.caf -o a.out main.f90
 
Using SLURM, I execute:
ucx_info -v
ucx_info -d | grep Transport
ibv_devinfo
lspci | grep Mellanox
export I_MPI_OFI_PROVIDER=mlx  # same result for FI_PROVIDER=mlx
./a.out.
 
I saw it mentioned somewhere that this is a known bug with some workarounds. Is this still the case? Is there anything else I am doing wrong?
 
Using I_MPI_DEBUG=100, I get:
 

# UCT version=1.12.1 revision dc92435
# configured with: --build=x86_64-redhat-linux-gnu --host=x86_64-redhat-linux-gnu --program-prefix= --disable-dependency-tracking --prefix=/usr --exec-prefix=/usr --bindir=/usr/bin --sbindir=/usr/sbin --sysconfdir=/etc --datadir=/usr/share --includedir=/usr/include --libdir=/usr/lib64 --libexecdir=/usr/libexec --localstatedir=/var --sharedstatedir=/var/lib --mandir=/usr/share/man --infodir=/usr/share/info --disable-optimizations --disable-logging --disable-debug --disable-assertions --disable-params-check --without-java --enable-cma --without-cuda --without-gdrcopy --with-verbs --without-knem --with-rdmacm --without-rocm --without-xpmem --without-fuse3 --without-ugni
# Transport: posix
# Transport: sysv
# Transport: self
# Transport: tcp
# Transport: tcp
# Transport: tcp
# Transport: rc_verbs
# Transport: rc_mlx5
# Transport: dc_mlx5
# Transport: ud_verbs
# Transport: ud_mlx5
# Transport: cma
hca_id: mlx5_0
transport: InfiniBand (0)
fw_ver: 20.32.1010
node_guid: 08c0:eb03:002c:f98c
sys_image_guid: 08c0:eb03:002c:f98c
vendor_id: 0x02c9
vendor_part_id: 4123
hw_ver: 0x0
board_id: MT_0000000223
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 91
port_lid: 67
port_lmc: 0x00
link_layer: InfiniBand

a1:00.0 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6]
IPL WARN> Not all cpus are available, switch to I_MPI_PIN_ORDER=compact. (Total: 256 Available: 1)
[0] MPI startup(): Run 'pmi_process_mapping' nodemap algorithm
[0] MPI startup(): Intel(R) MPI Library, Version 2021.4 Build 20210831 (id: 758087adf)
[0] MPI startup(): Copyright (C) 2003-2021 Intel Corporation. All rights reserved.IPL WARN> Not all cpus are available, switch to I_MPI_PIN_ORDER=compact. (Total: 256 Available: 1)

[0] MPI startup(): library kind: release
[0] MPI startup(): libfabric version: 1.13.0-impi
libfabric:1779943:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_CUDA not supported
libfabric:1779943:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_ROCR not supported
libfabric:1779943:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_ZE not supported
libfabric:1779943:core:core:ofi_register_provider():474<info> registering provider: sockets (113.0)
libfabric:1779943:core:core:ofi_register_provider():502<info> "sockets" filtered by provider include/exclude list, skipping
libfabric:1779943:core:core:ofi_register_provider():474<info> registering provider: psm2 (113.0)
libfabric:1779943:core:core:ofi_register_provider():502<info> "psm2" filtered by provider include/exclude list, skipping
libfabric:1779943:core:core:ofi_register_provider():474<info> registering provider: ofi_rxm (113.0)
libfabric:1779943:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_CUDA not supported
libfabric:1779943:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_ROCR not supported
libfabric:1779943:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_ZE not supported
libfabric:1779943:core:core:ofi_register_provider():474<info> registering provider: tcp (113.0)
libfabric:1779943:core:core:ofi_register_provider():502<info> "tcp" filtered by provider include/exclude list, skipping
libfabric:1779943:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_CUDA not supported
libfabric:1779943:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_ROCR not supported
libfabric:1779943:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_ZE not supported
libfabric:1779943:core:core:ofi_register_provider():474<info> registering provider: shm (113.0)
libfabric:1779943:core:core:ofi_register_provider():502<info> "shm" filtered by provider include/exclude list, skipping
libfabric:1779943:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_CUDA not supported
libfabric:1779943:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_ROCR not supported
libfabric:1779943:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_ZE not supported
libfabric:1779943:core:core:ofi_register_provider():474<info> registering provider: verbs (113.0)
libfabric:1779943:core:core:ofi_register_provider():502<info> "verbs" filtered by provider include/exclude list, skipping
libfabric:1779943:core:core:ofi_register_provider():474<info> registering provider: mlx (1.4)
libfabric:1779943:psm3:core:fi_prov_ini():680<info> build options: VERSION=1101.0=11.1.0.0, HAVE_PSM3_src=1, PSM3_CUDA=0
libfabric:1779943:core:core:ofi_register_provider():474<info> registering provider: psm3 (1101.0)
libfabric:1779943:core:core:ofi_register_provider():502<info> "psm3" filtered by provider include/exclude list, skipping
libfabric:1779943:core:core:ofi_register_provider():474<info> registering provider: ofi_hook_noop (113.0)
libfabric:1779943:core:core:fi_getinfo_():1138<info> Found provider with the highest priority mlx, must_use_util_prov = 0
libfabric:1779943:core:core:fi_getinfo_():1138<info> Found provider with the highest priority mlx, must_use_util_prov = 0
[0] MPI startup(): libfabric provider: mlx
libfabric:1779943:core:core:fi_fabric_():1423<info> Opened fabric: mlx
[0] MPI startup(): max_ch4_vcis: 1, max_reg_eps 64, enable_sep 0, enable_shared_ctxs 0, do_av_insert 1
[0] MPI startup(): addrnamelen: 1024
[0] MPI startup(): File "" not found
[0] MPI startup(): Load tuning file: "/gpfs/opt/sw/spack-0.17.1/opt/spack/linux-almalinux8-zen3/gcc-11.2.0/intel-oneapi-mpi-2021.4.0-2e7zm7zu5t7iqbzr7xhjkwivxg3ry5bh/mpi/2021.4.0/etc/tuning_generic_shm-ofi.dat"
[0] MPI startup(): Rank Pid Node name Pin cpu
[0] MPI startup(): 0 1779943 n3071-003 {0}
[0] MPI startup(): 1 3984445 n3071-004 {0}
[0] MPI startup(): I_MPI_ROOT=/gpfs/opt/sw/spack-0.17.1/opt/spack/linux-almalinux8-zen3/gcc-11.2.0/intel-oneapi-mpi-2021.4.0-2e7zm7zu5t7iqbzr7xhjkwivxg3ry5bh/mpi/2021.4.0
[0] MPI startup(): I_MPI_FAULT_CONTINUE=1
[0] MPI startup(): I_MPI_HYDRA_TOPOLIB=hwloc
[0] MPI startup(): I_MPI_PIN_DOMAIN=1
[0] MPI startup(): I_MPI_INTERNAL_MEM_POLICY=default
[0] MPI startup(): I_MPI_DEBUG=100
[0] MPI startup(): I_MPI_REMOVED_VAR_WARNING=0
[0] MPI startup(): I_MPI_VAR_CHECK_SPELLING=0
[0] MPI startup(): I_MPI_SPIN_COUNT=1
[0] MPI startup(): I_MPI_THREAD_YIELD=2
[0] MPI startup(): I_MPI_SILENT_ABORT=1
[n3071-003:1779943:0:1779943] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x31)
[n3071-004:3984445:0:3984445] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x31)
==== backtrace (tid:1779943) ====
0 /lib64/libucs.so.0(ucs_handle_error+0x2dc) [0x14deacde64fc]
==== backtrace (tid:3984445) ====
1 /lib64/libucs.so.0(+0x2a6dc) [0x14deacde66dc]
0 /lib64/libucs.so.0(ucs_handle_error+0x2dc) [0x150673aa34fc]
2 /lib64/libucs.so.0(+0x2a8aa) [0x14deacde68aa]
1 /lib64/libucs.so.0(+0x2a6dc) [0x150673aa36dc]
3 /lib64/libpthread.so.0(+0x12c20) [0x14deb399ec20]
2 /lib64/libucs.so.0(+0x2a8aa) [0x150673aa38aa]
4 /lib64/ucx/libuct_ib.so.0(+0x294cd) [0x14deac3e24cd]
3 /lib64/libpthread.so.0(+0x12c20) [0x15067a65bc20]
5 /lib64/ucx/libuct_ib.so.0(+0x29918) [0x14deac3e2918]
4 /lib64/ucx/libuct_ib.so.0(+0x294cd) [0x15067309f4cd]
6 /lib64/ucx/libuct_ib.so.0(+0x2266d) [0x14deac3db66d]
5 /lib64/ucx/libuct_ib.so.0(+0x29918) [0x15067309f918]
7 /lib64/ucx/libuct_ib.so.0(+0x22ca8) [0x14deac3dbca8]
6 /lib64/ucx/libuct_ib.so.0(+0x2266d) [0x15067309866d]
8 /lib64/libucs.so.0(ucs_rcache_get+0x2a6) [0x14deacdeca66]
7 /lib64/ucx/libuct_ib.so.0(+0x22ca8) [0x150673098ca8]
9 /lib64/ucx/libuct_ib.so.0(+0x230d8) [0x14deac3dc0d8]
8 /lib64/libucs.so.0(ucs_rcache_get+0x2a6) [0x150673aa9a66]
10 /lib64/libucp.so.0(ucp_mem_rereg_mds+0x31f) [0x14dead495bcf]
11 /lib64/libucp.so.0(+0x2f432) [0x14dead496432]
12 /lib64/libucp.so.0(ucp_mem_map+0x13e) [0x14dead4967ee]
13 /gpfs/opt/sw/spack-0.17.1/opt/spack/linux-almalinux8-zen3/gcc-11.2.0/intel-oneapi-mpi-2021.4.0-2e7zm7zu5t7iqbzr7xhjkwivxg3ry5bh/mpi/2021.4.0//libfabric/lib/prov/libmlx-fi.so(+0x940d) [0x14dead72240d]
14 /gpfs/opt/sw/spack-0.17.1/opt/spack/linux-almalinux8-zen3/gcc-11.2.0/intel-oneapi-mpi-2021.4.0-2e7zm7zu5t7iqbzr7xhjkwivxg3ry5bh/mpi/2021.4.0//libfabric/lib/prov/libmlx-fi.so(+0x94e8) [0x14dead7224e8]
15 /gpfs/opt/sw/spack-0.17.1/opt/spack/linux-almalinux8-zen3/gcc-11.2.0/intel-oneapi-mpi-2021.4.0-2e7zm7zu5t7iqbzr7xhjkwivxg3ry5bh/mpi/2021.4.0//libfabric/lib/prov/libmlx-fi.so(+0x952a) [0x14dead72252a]
16 /gpfs/opt/sw/spack-0.17.1/opt/spack/linux-almalinux8-zen3/gcc-11.2.0/intel-oneapi-mpi-2021.4.0-2e7zm7zu5t7iqbzr7xhjkwivxg3ry5bh/mpi/2021.4.0//lib/release/libmpi.so.12(+0x614ee4) [0x14deb20e6ee4]
17 /gpfs/opt/sw/spack-0.17.1/opt/spack/linux-almalinux8-zen3/gcc-11.2.0/intel-oneapi-mpi-2021.4.0-2e7zm7zu5t7iqbzr7xhjkwivxg3ry5bh/mpi/2021.4.0//lib/release/libmpi.so.12(+0x614773) [0x14deb20e6773]
18 /gpfs/opt/sw/spack-0.17.1/opt/spack/linux-almalinux8-zen3/gcc-11.2.0/intel-oneapi-mpi-2021.4.0-2e7zm7zu5t7iqbzr7xhjkwivxg3ry5bh/mpi/2021.4.0//lib/release/libmpi.so.12(+0x233483) [0x14deb1d05483]
19 /gpfs/opt/sw/spack-0.17.1/opt/spack/linux-almalinux8-zen3/gcc-11.2.0/intel-oneapi-mpi-2021.4.0-2e7zm7zu5t7iqbzr7xhjkwivxg3ry5bh/mpi/2021.4.0//lib/release/libmpi.so.12(+0x2502e2) [0x14deb1d222e2]
20 /gpfs/opt/sw/spack-0.17.1/opt/spack/linux-almalinux8-zen3/gcc-11.2.0/intel-oneapi-mpi-2021.4.0-2e7zm7zu5t7iqbzr7xhjkwivxg3ry5bh/mpi/2021.4.0//lib/release/libmpi.so.12(MPI_Win_create+0x3c2) [0x14deb2278c32]
21 /gpfs/opt/sw/spack-0.17.1/opt/spack/linux-almalinux8-zen3/gcc-11.2.0/intel-oneapi-compilers-2022.0.2-yzi4tsud2tqh4s6ykg2ulr7pp7guyiej/compiler/2022.0.2/linux/compiler/lib/intel64_lin/libicaf.so(for_rtl_ICAF_INIT+0xba2) [0x14deb4101162]
22 ./a.out() [0x407ee4]
23 ./a.out() [0x40441d]
24 /lib64/libc.so.6(__libc_start_main+0xf3) [0x14deb35ea493]
25 ./a.out() [0x40432e]
=================================
9 /lib64/ucx/libuct_ib.so.0(+0x230d8) [0x1506730990d8]
10 /lib64/libucp.so.0(ucp_mem_rereg_mds+0x31f) [0x150674152bcf]
11 /lib64/libucp.so.0(+0x2f432) [0x150674153432]
12 /lib64/libucp.so.0(ucp_mem_map+0x13e) [0x1506741537ee]
13 /gpfs/opt/sw/spack-0.17.1/opt/spack/linux-almalinux8-zen3/gcc-11.2.0/intel-oneapi-mpi-2021.4.0-2e7zm7zu5t7iqbzr7xhjkwivxg3ry5bh/mpi/2021.4.0//libfabric/lib/prov/libmlx-fi.so(+0x940d) [0x1506743df40d]
14 /gpfs/opt/sw/spack-0.17.1/opt/spack/linux-almalinux8-zen3/gcc-11.2.0/intel-oneapi-mpi-2021.4.0-2e7zm7zu5t7iqbzr7xhjkwivxg3ry5bh/mpi/2021.4.0//libfabric/lib/prov/libmlx-fi.so(+0x94e8) [0x1506743df4e8]
15 /gpfs/opt/sw/spack-0.17.1/opt/spack/linux-almalinux8-zen3/gcc-11.2.0/intel-oneapi-mpi-2021.4.0-2e7zm7zu5t7iqbzr7xhjkwivxg3ry5bh/mpi/2021.4.0//libfabric/lib/prov/libmlx-fi.so(+0x952a) [0x1506743df52a]
16 /gpfs/opt/sw/spack-0.17.1/opt/spack/linux-almalinux8-zen3/gcc-11.2.0/intel-oneapi-mpi-2021.4.0-2e7zm7zu5t7iqbzr7xhjkwivxg3ry5bh/mpi/2021.4.0//lib/release/libmpi.so.12(+0x614ee4) [0x150678da3ee4]
17 /gpfs/opt/sw/spack-0.17.1/opt/spack/linux-almalinux8-zen3/gcc-11.2.0/intel-oneapi-mpi-2021.4.0-2e7zm7zu5t7iqbzr7xhjkwivxg3ry5bh/mpi/2021.4.0//lib/release/libmpi.so.12(+0x614773) [0x150678da3773]
18 /gpfs/opt/sw/spack-0.17.1/opt/spack/linux-almalinux8-zen3/gcc-11.2.0/intel-oneapi-mpi-2021.4.0-2e7zm7zu5t7iqbzr7xhjkwivxg3ry5bh/mpi/2021.4.0//lib/release/libmpi.so.12(+0x233483) [0x1506789c2483]
19 /gpfs/opt/sw/spack-0.17.1/opt/spack/linux-almalinux8-zen3/gcc-11.2.0/intel-oneapi-mpi-2021.4.0-2e7zm7zu5t7iqbzr7xhjkwivxg3ry5bh/mpi/2021.4.0//lib/release/libmpi.so.12(+0x2502e2) [0x1506789df2e2]
20 /gpfs/opt/sw/spack-0.17.1/opt/spack/linux-almalinux8-zen3/gcc-11.2.0/intel-oneapi-mpi-2021.4.0-2e7zm7zu5t7iqbzr7xhjkwivxg3ry5bh/mpi/2021.4.0//lib/release/libmpi.so.12(MPI_Win_create+0x3c2) [0x150678f35c32]
21 /gpfs/opt/sw/spack-0.17.1/opt/spack/linux-almalinux8-zen3/gcc-11.2.0/intel-oneapi-compilers-2022.0.2-yzi4tsud2tqh4s6ykg2ulr7pp7guyiej/compiler/2022.0.2/linux/compiler/lib/intel64_lin/libicaf.so(for_rtl_ICAF_INIT+0xba2) [0x15067adbe162]
22 ./a.out() [0x407ee4]
23 ./a.out() [0x40441d]
24 /lib64/libc.so.6(__libc_start_main+0xf3) [0x15067a2a7493]
25 ./a.out() [0x40432e]
=================================

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 0 PID 1779943 RUNNING AT n3071-003
= KILLED BY SIGNAL: 11 (Segmentation fault)
===================================================================================

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 1 PID 3984445 RUNNING AT n3071-004
= KILLED BY SIGNAL: 11 (Segmentation fault)
===================================================================================

Labels (2)
0 Kudos
5 Replies
SantoshY_Intel
Moderator
1,521 Views

Hi,

 

Thanks for posting in the Intel communities.

 

From your debug log, we can see that you are using Intel MPI 2021.4 & trying to run your application on 2 nodes using MLX fabric provider.

 

Could you please provide the following details to help us investigate your issue?

  1. Operating system & its version
  2. CPU details
  3. Sample reproducer code
  4. Expected output or provide us the complete output log when FI_PROVIDER=verbs.

 

>>"I saw it mentioned somewhere that this is a known bug with some workarounds. Is this still the case? Is there anything else I am doing wrong?"

Could you please provide us the link that you are referring to? Or provide us the exact workaroud that you are referring to?

Since, most of the known issues of Intel MPI 2021.4 were fixed in the later releases, could you please try with the latest Intel MPI 2021.8 & get back to us if  the problem persists?

 

Thanks & Regards,

Santosh

 

 

 

0 Kudos
as14
Beginner
1,510 Views

Thanks for getting back to me!

 

>>"1. Operating system & its version"

cat /etc/os-release

NAME="AlmaLinux"
VERSION="8.5 (Arctic Sphynx)"
ID="almalinux"
ID_LIKE="rhel centos fedora"
VERSION_ID="8.5"
PLATFORM_ID="platform:el8"
PRETTY_NAME="AlmaLinux 8.5 (Arctic Sphynx)"
ANSI_COLOR="0;34"
CPE_NAME="cpe:/o:almalinux:almalinux:8::baseos"
HOME_URL="https://almalinux.org/"
DOCUMENTATION_URL="https://wiki.almalinux.org/"
BUG_REPORT_URL="https://bugs.almalinux.org/"

ALMALINUX_MANTISBT_PROJECT="AlmaLinux-8"
ALMALINUX_MANTISBT_PROJECT_VERSION="8.5"

 

>>"2. CPU details"

$ lscpu

Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 256
On-line CPU(s) list: 0-255
Thread(s) per core: 2
Core(s) per socket: 64
Socket(s): 2
NUMA node(s): 8
Vendor ID: AuthenticAMD
CPU family: 25
Model: 1
Model name: AMD EPYC 7713 64-Core Processor
Stepping: 1
CPU MHz: 2000.000
CPU max MHz: 3720.7029
CPU min MHz: 1500.0000
BogoMIPS: 4000.16
Virtualization: AMD-V
L1d cache: 32K
L1i cache: 32K
L2 cache: 512K
L3 cache: 32768K
NUMA node0 CPU(s): 0-15,128-143
NUMA node1 CPU(s): 16-31,144-159
NUMA node2 CPU(s): 32-47,160-175
NUMA node3 CPU(s): 48-63,176-191
NUMA node4 CPU(s): 64-79,192-207
NUMA node5 CPU(s): 80-95,208-223
NUMA node6 CPU(s): 96-111,224-239
NUMA node7 CPU(s): 112-127,240-255
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 invpcid_single hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 invpcid cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr wbnoinvd amd_ppin arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold v_vmsave_vmload vgif v_spec_ctrl umip pku ospke vaes vpclmulqdq rdpid overflow_recov succor smca sme sev sev_es

 

>>"3. Sample reproducer code"

Attached.

 

>>"4. Expected output or provide us the complete output log when FI_PROVIDER=verbs."

Output log when using verbs (which is same as what is expected when using mlx) is attached.

 

>>"Workaround link"

https://blog.hpc.qmul.ac.uk/intel-release-2020_4.html. Look under "known issues using fortran coarrays"

 

Sadly I cannot test with this different version of intel-mpi as it is not installed on our cluster.

 

Thanks very much again for the help!

0 Kudos
as14
Beginner
1,506 Views

Additionally, I am attaching here the output file when running with 

 

export FI_PROVIDER=mlx
export I_MPI_OFI_PROVIDER=mlx
 
instead of "export I_MPI_OFI_PROVIDER=Verbs".
 
Thanks again!!
0 Kudos
SantoshY_Intel
Moderator
1,472 Views

Hi,


Thank you for your inquiry. We can only offer direct support for Intel hardware platforms that the Intel® oneAPI product supports. Intel provides instructions on how to compile oneAPI code for both CPU and a wide range of GPU accelerators. https://intel.github.io/llvm-docs/GetStartedGuide.html


Thanks & Regards,

Santosh


0 Kudos
as14
Beginner
1,383 Views

Hi Santosh,

 

The problem is not that actually, but is with Intel's glitchy support of Coarrays over Infiniband connectors.

 

Intel officially supports MPI over Infiniband. See here: https://www.intel.com/content/www/us/en/developer/articles/technical/improve-performance-and-stability-with-intel-mpi-library-on-infiniband.html

and here: https://www.intel.com/content/www/us/en/developer/articles/technical/mpi-compatibility-nvidia-mellanox-ofed-infiniband.html

 

However, a glitch in the coarray implementation has been found, with a suggested fix being to set MPIR_CVAR_CH4_OFI_ENABLE_RMA=0 (mentioned for e.g. here https://blog.hpc.qmul.ac.uk/intel-release-2020_4.html). This does allow the communication to work, but unusably slowly. Firstly, could you let me know why this allows it to work, why it makes it much slower, and what it actually means?

 

Does Intel plan to apply a proper fix for this bug in future releases? If not, I will be required to rewrite my whole code in MPI as it cannot be ported anywhere beyond Omnipath connectors (which are being superseded by Infiniband connectors on HPCs).

 

Thanks for any help,

James

0 Kudos
Reply