Intel® MPI Library
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.
2239 Discussions

MLX provider not working with oneAPI 2022.2/MPI 2021.6

Antonio_D
New Contributor I
9,618 Views

Hello,

I have an MLX provider issue with Intel MPI 2021.6 with all code built with oneAPI 2022.2.  My script:

export OMP_NUM_THREADS=1
export MKL_NUM_THREADS=1
export MKL_DYNAMIC=FALSE
export UCX_TLS=sm,rc_mlx5,dc_mlx5,ud_mlx5,self
export LD_PRELOAD=$I_MPI_ROOT/lib/libmpi_shm_heap_proxy.so
export I_MPI_HYDRA_BOOTSTRAP=lsf
export I_MPI_HYDRA_RMK=lsf
export I_MPI_HYDRA_TOPOLIB=hwloc
export I_MPI_HYDRA_IFACE=ib0
export I_MPI_PLATFORM=clx-ap
export I_MPI_EXTRA_FILESYSTEM=1
export I_MPI_EXTRA_FILESYSTEM_FORCE=gpfs
export I_MPI_FABRICS=shm:ofi
export I_MPI_SHM=clx-ap
export I_MPI_SHM_HEAP=1
export I_MPI_OFI_PROVIDER=mlx
export I_MPI_PIN_CELL=core
export I_MPI_DEBUG=6
mpirun -n 96 ./executable

 The output:

[0] MPI startup(): Intel(R) MPI Library, Version 2021.6 Build 20220227 (id: 28877f3f32)
[0] MPI startup(): Copyright (C) 2003-2022 Intel Corporation. All rights reserved.
[0] MPI startup(): library kind: release
impi_shm_mbind_local(): mbind(p=0x14ad3ea72000, size=4294967296) error=1 "Operation not permitted"

//SNIP//

impi_shm_mbind_local(): mbind(p=0x1458ca7f7000, size=4294967296) error=1 "Operation not permitted"

[0] MPI startup(): libfabric version: 1.13.2rc1-impi
Abort(2139535) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(178)........:
MPID_Init(1532)..............:
MPIDI_OFI_mpi_init_hook(1512):
open_fabric(2566)............:
find_provider(2684)..........: OFI fi_getinfo() failed (ofi_init.c:2684:find_provider:No data available)
Abort(2139535) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(178)........:
MPID_Init(1532)..............:
MPIDI_OFI_mpi_init_hook(1512):
open_fabric(2566)............:
find_provider(2684)..........: OFI fi_getinfo() failed (ofi_init.c:2684:find_provider:No data available)

 

I do have Mellanox UCX Framework v1.8 installed and it is recognized:

[dipasqua@ec-hub1-sc1 ~]$ ucx_info -v
# UCT version=1.8.0 revision
# configured with: --prefix=/apps/rocs/2020.08/cascadelake/software/UCX/1.8.0-GCCcore-9.3.0 --build=x86_64-pc-linux-gnu --host=x86_64-pc-linux-gnu --with-rdmacm=/apps/rocs/2020.08/prefix/usr --with-verbs=/apps/rocs/2020.08/prefix/usr --with-knem=/apps/rocs/2020.08/prefix/usr --enable-optimizations --enable-cma --enable-mt --without-java --disable-doxygen-doc
[dipasqua@ec-hub1-sc1 ~]$ fi_info -l
psm2:
version: 113.20
mlx:
version: 1.4
psm3:
version: 1102.0
ofi_rxm:
version: 113.20
verbs:
version: 113.20
tcp:
version: 113.20
sockets:
version: 113.20
shm:
version: 114.0
ofi_hook_noop:
version: 113.20
[dipasqua@ec-hub1-sc1 ~]$ ucx_info -d | grep Transport
# Transport: posix
# Transport: sysv
# Transport: self
# Transport: tcp
# Transport: tcp
# Transport: rc_verbs
# Transport: rc_mlx5
# Transport: dc_mlx5
# Transport: ud_verbs
# Transport: ud_mlx5
# Transport: rc_verbs
# Transport: rc_mlx5
# Transport: ud_verbs
# Transport: ud_mlx5
# Transport: rc_verbs
# Transport: rc_mlx5
# Transport: dc_mlx5
# Transport: ud_verbs
# Transport: ud_mlx5
# Transport: cma
# Transport: knem 

 

Everything works just fine with oneAPI 2022.1 (Intel MPI 2021.5), however, with all settings the same.  Any ideas or do we have a bug?

 

Regards,

Antonio

0 Kudos
1 Solution
Antonio_D
New Contributor I
7,986 Views

I figured out the solution to my problem.  The incorrect interpreter was being used, so I forced the correct interpreter by adding the following:

-dynamic-linker=/lib64/ld-linux-x86-64.so.2

to my FFLAGS and CFLAGS line before compiling.

Thanks for all the help! 

View solution in original post

0 Kudos
30 Replies
ShivaniK_Intel
Moderator
3,167 Views

Hi,


Could you please raise the memory limit in a test job?


example :


Line #5 in fhibench.sh: 


Before

#BSUB -R rusage[mem=4G]


After 

#BSUB -R rusage[mem=10G]


this is just to check if the issue has to do with the memory binding of Intel MPI.


Please let us know the output after the changes.


Thanks & Regards

Shivani


0 Kudos
Antonio_D
New Contributor I
3,157 Views

Hello,

After changing #BSUB -R rusage[mem=4G] to #BSUB -R rusage[mem=10G], the output is the same:

[0] MPI startup(): Intel(R) MPI Library, Version 2021.8 Build 20221129 (id: 339ec755a1)
[0] MPI startup(): Copyright (C) 2003-2022 Intel Corporation. All rights reserved.
[0] MPI startup(): library kind: release
[20] impi_shm_mbind_local(): mbind(p=0x15080b28e000, size=4294967296) error=1 "Operation not permitted"

[31] impi_shm_mbind_local(): mbind(p=0x14f1ea4c6000, size=4294967296) error=1 "Operation not permitted"

[25] impi_shm_mbind_local(): mbind(p=0x153c82356000, size=4294967296) error=1 "Operation not permitted"

[27] impi_shm_mbind_local(): mbind(p=0x14a90a295000, size=4294967296) error=1 "Operation not permitted"

[0] impi_shm_mbind_local(): mbind(p=0x1486a2f93000, size=4294967296) error=1 "Operation not permitted"

[1] impi_shm_mbind_local(): mbind(p=0x14b3bb69f000, size=4294967296) error=1 "Operation not permitted"

[2] impi_shm_mbind_local(): mbind(p=0x151e5966e000, size=4294967296) error=1 "Operation not permitted"

[3] impi_shm_mbind_local(): mbind(p=0x153683af2000, size=4294967296) error=1 "Operation not permitted"

[4] impi_shm_mbind_local(): mbind(p=0x1517179ce000, size=4294967296) error=1 "Operation not permitted"

[5] impi_shm_mbind_local(): mbind(p=0x152a17766000, size=4294967296) error=1 "Operation not permitted"

[6] impi_shm_mbind_local(): mbind(p=0x1500b0773000, size=4294967296) error=1 "Operation not permitted"

[7] impi_shm_mbind_local(): mbind(p=0x14682fde6000, size=4294967296) error=1 "Operation not permitted"

[8] impi_shm_mbind_local(): mbind(p=0x1456666e4000, size=4294967296) error=1 "Operation not permitted"

[9] impi_shm_mbind_local(): mbind(p=0x14f1d6a73000, size=4294967296) error=1 "Operation not permitted"

[10] impi_shm_mbind_local(): mbind(p=0x14a841642000, size=4294967296) error=1 "Operation not permitted"

[11] impi_shm_mbind_local(): mbind(p=0x152523e55000, size=4294967296) error=1 "Operation not permitted"

[12] impi_shm_mbind_local(): mbind(p=0x14c5b945b000, size=4294967296) error=1 "Operation not permitted"

[13] impi_shm_mbind_local(): mbind(p=0x1528bb6e0000, size=4294967296) error=1 "Operation not permitted"

[14] impi_shm_mbind_local(): mbind(p=0x146a92686000, size=4294967296) error=1 "Operation not permitted"

[15] impi_shm_mbind_local(): mbind(p=0x15190567f000, size=4294967296) error=1 "Operation not permitted"

[16] impi_shm_mbind_local(): mbind(p=0x154205d8c000, size=4294967296) error=1 "Operation not permitted"

[17] impi_shm_mbind_local(): mbind(p=0x153a7dedb000, size=4294967296) error=1 "Operation not permitted"

[18] impi_shm_mbind_local(): mbind(p=0x145faf6bb000, size=4294967296) error=1 "Operation not permitted"

[19] impi_shm_mbind_local(): mbind(p=0x152839ca1000, size=4294967296) error=1 "Operation not permitted"

[21] impi_shm_mbind_local(): mbind(p=0x147daa1e6000, size=4294967296) error=1 "Operation not permitted"

[22] impi_shm_mbind_local(): mbind(p=0x146bed2d7000, size=4294967296) error=1 "Operation not permitted"

[23] impi_shm_mbind_local(): mbind(p=0x1500c5fd2000, size=4294967296) error=1 "Operation not permitted"

[24] impi_shm_mbind_local(): mbind(p=0x148aff1c1000, size=4294967296) error=1 "Operation not permitted"

[26] impi_shm_mbind_local(): mbind(p=0x14d5f5295000, size=4294967296) error=1 "Operation not permitted"

[28] impi_shm_mbind_local(): mbind(p=0x14796e2ae000, size=4294967296) error=1 "Operation not permitted"

[29] impi_shm_mbind_local(): mbind(p=0x14ba2d020000, size=4294967296) error=1 "Operation not permitted"

[30] impi_shm_mbind_local(): mbind(p=0x153999332000, size=4294967296) error=1 "Operation not permitted"

[0] mbind_interleave(): mbind(p=0x14a6a726c000, size=201981952) error=1 "Operation not permitted"

[0] mbind_interleave(): mbind(p=0x14a6b330c000, size=110047232) error=1 "Operation not permitted"

[1] mbind_interleave(): mbind(p=0x14d2d230b000, size=110047232) error=1 "Operation not permitted"

[2] mbind_interleave(): mbind(p=0x153c76bcd000, size=110047232) error=1 "Operation not permitted"

[3] mbind_interleave(): mbind(p=0x1553a7944000, size=110047232) error=1 "Operation not permitted"

[4] mbind_interleave(): mbind(p=0x153342113000, size=110047232) error=1 "Operation not permitted"

[5] mbind_interleave(): mbind(p=0x15454879e000, size=110047232) error=1 "Operation not permitted"

[6] mbind_interleave(): mbind(p=0x151ae809e000, size=110047232) error=1 "Operation not permitted"

[7] mbind_interleave(): mbind(p=0x14816e004000, size=110047232) error=1 "Operation not permitted"

[8] mbind_interleave(): mbind(p=0x146eab1f5000, size=110047232) error=1 "Operation not permitted"

[9] mbind_interleave(): mbind(p=0x150921e77000, size=110047232) error=1 "Operation not permitted"

[10] mbind_interleave(): mbind(p=0x14be93339000, size=110047232) error=1 "Operation not permitted"

[11] mbind_interleave(): mbind(p=0x153a7c43f000, size=110047232) error=1 "Operation not permitted"

[12] mbind_interleave(): mbind(p=0x14da18338000, size=110047232) error=1 "Operation not permitted"

[13] mbind_interleave(): mbind(p=0x153c20eb0000, size=110047232) error=1 "Operation not permitted"

[14] mbind_interleave(): mbind(p=0x147cfe749000, size=110047232) error=1 "Operation not permitted"

[15] mbind_interleave(): mbind(p=0x152a78035000, size=110047232) error=1 "Operation not permitted"

[16] mbind_interleave(): mbind(p=0x15527f035000, size=110047232) error=1 "Operation not permitted"

[17] mbind_interleave(): mbind(p=0x1549fda77000, size=110047232) error=1 "Operation not permitted"

[18] mbind_interleave(): mbind(p=0x146e35b4a000, size=110047232) error=1 "Operation not permitted"

[19] mbind_interleave(): mbind(p=0x1535c6a23000, size=110047232) error=1 "Operation not permitted"

[20] mbind_interleave(): mbind(p=0x15149e903000, size=110047232) error=1 "Operation not permitted"

[21] mbind_interleave(): mbind(p=0x14894414e000, size=110047232) error=1 "Operation not permitted"

[22] mbind_interleave(): mbind(p=0x14768db32000, size=110047232) error=1 "Operation not permitted"

[23] mbind_interleave(): mbind(p=0x150a6d120000, size=110047232) error=1 "Operation not permitted"

[24] mbind_interleave(): mbind(p=0x1493acc02000, size=110047232) error=1 "Operation not permitted"

[25] mbind_interleave(): mbind(p=0x15443668a000, size=110047232) error=1 "Operation not permitted"

[26] mbind_interleave(): mbind(p=0x14dcafebc000, size=110047232) error=1 "Operation not permitted"

[27] mbind_interleave(): mbind(p=0x14aecb7af000, size=110047232) error=1 "Operation not permitted"

[28] mbind_interleave(): mbind(p=0x147e360bb000, size=110047232) error=1 "Operation not permitted"

[29] mbind_interleave(): mbind(p=0x14bdfb720000, size=110047232) error=1 "Operation not permitted"

[30] mbind_interleave(): mbind(p=0x153c6e325000, size=110047232) error=1 "Operation not permitted"

[31] mbind_interleave(): mbind(p=0x14f3c5dac000, size=110047232) error=1 "Operation not permitted"

[0] MPI startup(): libfabric version: 1.13.2rc1-impi
Abort(2139535) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(176)........:
MPID_Init(1525)..............:
MPIDI_OFI_mpi_init_hook(1516):
open_fabric(2570)............:
find_provider(2692)..........: OFI fi_getinfo() failed (ofi_init.c:2692:find_provider:No data available)

0 Kudos
ShivaniK_Intel
Moderator
3,133 Views

Hi,


Thank you for your patience.


Could you please provide us with the output of the below commands?


$ df -h 


$ df -h /dev/shm


Could you please let us know whether the IMB run is really using the same parameters? As IMB experiments shown in the community link are 

missing the submission script.


Could you run the program single node removing all ofi provider-related variables (MPI might work better with defaults) and provide us with

the output.


LSBATCH: User input

#!/bin/bash

#BSUB -J fhibench

#BSUB -n 32

#BSUB -q preempt

#BSUB -R rusage[mem=4G]

#BSUB -R span[block=32]

#BSUB -R "model == HPE_APOLLO2000_64"

#BSUB -R affinity[core(1):cpubind=core:membind=localonly:distribute=pack]

#BSUB -o fhibench.o%J

#BSUB -e fhibench.e%J

export OMP_NUM_THREADS=1

export MKL_NUM_THREADS=1

export MKL_DYNAMIC=FALSE

###export UCX_TLS=sm,rc_mlx5,dc_mlx5,ud_mlx5,self

export LD_PRELOAD=$I_MPI_ROOT/lib/libmpi_shm_heap_proxy.so

export I_MPI_HYDRA_BOOTSTRAP=lsf

export I_MPI_HYDRA_RMK=lsf

export I_MPI_HYDRA_TOPOLIB=hwloc

###export I_MPI_HYDRA_IFACE=ib0

###export I_MPI_PLATFORM=clx-ap

###export I_MPI_FABRICS=shm:ofi # this is default anyhow! 

###export I_MPI_SHM=clx-ap

export I_MPI_SHM_HEAP=1

###export I_MPI_OFI_PROVIDER=mlx

export I_MPI_PIN_CELL=core

export I_MPI_DEBUG=6

mpirun -n 32 /projects/site/gred/smpg/software/FHI-aims/bin/aims.220117.scalapack.mpi.x 2>&1 | tee FHIaims.out



Thanks & Regards

Shivani


0 Kudos
Antonio_D
New Contributor I
3,127 Views
From a representative compute node:

[dipasqua@ec-hub1-sc1 ~]$ ssh sc1nc076is02
Last login: Sat Aug 27 17:52:07 2022 from 10.164.24.27
[dipasqua@sc1nc076is02 ~]$ df -h
Filesystem Size Used Avail Use% Mounted on
devtmpfs 252G 0 252G 0% /dev
tmpfs 252G 2.8M 252G 1% /run
/dev/nvme0n1p2 400G 25G 376G 7% /
tmpfs 252G 2.8G 250G 2% /dev/shm
/dev/nvme0n1p1 100M 0 100M 0% /boot/efi
/dev/nvme0n1p3 20G 33M 20G 1% /var/tmp
/dev/nvme0n1p5 5.4T 37M 5.4T 1% /tmp
sc1groups 13P 11P 2.3P 83% /sc1/groups
apps 201T 7.1T 193T 4% /apps
homefs 201T 11T 190T 6% /gpfs/homefs
projectsfs01 401T 207T 194T 52% /projects
scratchfs01 142T 96T 46T 68% /gpfs/scratchfs01
datafs 12P 6.3P 5.5P 54% /gpfs/datafs
sc1 800T 715T 85T 90% /gpfs/sc1
tmpfs 51G 4.0K 51G 1% /run/user/718664
[dipasqua@sc1nc076is02 ~]$ df -h /dev/shm
Filesystem Size Used Avail Use% Mounted on
tmpfs 252G 2.8G 250G 2% /dev/shm
[dipasqua@sc1nc076is02 ~]$

 

See attached for the output of IMB run through the LSF scheduler exactly like the program without any of the OFI provider-related variables.

 

Running the program on a single node after removing all OFI provider-related variables:

 

[0] MPI startup(): Intel(R) MPI Library, Version 2021.8 Build 20221129 (id: 339ec755a1)
[0] MPI startup(): Copyright (C) 2003-2022 Intel Corporation. All rights reserved.
[0] MPI startup(): library kind: release
[0] impi_shm_mbind_local(): mbind(p=0x1509ba02f000, size=4294967296) error=1 "Operation not permitted"

[1] impi_shm_mbind_local(): mbind(p=0x14edc88f2000, size=4294967296) error=1 "Operation not permitted"

[2] impi_shm_mbind_local(): mbind(p=0x14c10f3c1000, size=4294967296) error=1 "Operation not permitted"

[3] impi_shm_mbind_local(): mbind(p=0x152fb7d8b000, size=4294967296) error=1 "Operation not permitted"

[4] impi_shm_mbind_local(): mbind(p=0x14a2224c8000, size=4294967296) error=1 "Operation not permitted"

[5] impi_shm_mbind_local(): mbind(p=0x1474d0ce6000, size=4294967296) error=1 "Operation not permitted"

[6] impi_shm_mbind_local(): mbind(p=0x1480ad347000, size=4294967296) error=1 "Operation not permitted"

[7] impi_shm_mbind_local(): mbind(p=0x14b54530f000, size=4294967296) error=1 "Operation not permitted"

[8] impi_shm_mbind_local(): mbind(p=0x15193c8c7000, size=4294967296) error=1 "Operation not permitted"

[9] impi_shm_mbind_local(): mbind(p=0x1526d48f1000, size=4294967296) error=1 "Operation not permitted"

[10] impi_shm_mbind_local(): mbind(p=0x153aca16b000, size=4294967296) error=1 "Operation not permitted"

[11] impi_shm_mbind_local(): mbind(p=0x147bfac8d000, size=4294967296) error=1 "Operation not permitted"

[12] impi_shm_mbind_local(): mbind(p=0x153ba8db6000, size=4294967296) error=1 "Operation not permitted"

[13] impi_shm_mbind_local(): mbind(p=0x153ac7ab1000, size=4294967296) error=1 "Operation not permitted"

[14] impi_shm_mbind_local(): mbind(p=0x14c831fe4000, size=4294967296) error=1 "Operation not permitted"

[15] impi_shm_mbind_local(): mbind(p=0x14c6b7a39000, size=4294967296) error=1 "Operation not permitted"

[16] impi_shm_mbind_local(): mbind(p=0x14950ae28000, size=4294967296) error=1 "Operation not permitted"

[17] impi_shm_mbind_local(): mbind(p=0x151928dce000, size=4294967296) error=1 "Operation not permitted"

[18] impi_shm_mbind_local(): mbind(p=0x150a20dd6000, size=4294967296) error=1 "Operation not permitted"

[19] impi_shm_mbind_local(): mbind(p=0x144b2b493000, size=4294967296) error=1 "Operation not permitted"

[20] impi_shm_mbind_local(): mbind(p=0x1533bb12d000, size=4294967296) error=1 "Operation not permitted"

[21] impi_shm_mbind_local(): mbind(p=0x151c172eb000, size=4294967296) error=1 "Operation not permitted"

[22] impi_shm_mbind_local(): mbind(p=0x150ae80b3000, size=4294967296) error=1 "Operation not permitted"

[23] impi_shm_mbind_local(): mbind(p=0x147803da5000, size=4294967296) error=1 "Operation not permitted"

[24] impi_shm_mbind_local(): mbind(p=0x148732bff000, size=4294967296) error=1 "Operation not permitted"

[25] impi_shm_mbind_local(): mbind(p=0x147a26f35000, size=4294967296) error=1 "Operation not permitted"

[26] impi_shm_mbind_local(): mbind(p=0x15014f7b9000, size=4294967296) error=1 "Operation not permitted"

[27] impi_shm_mbind_local(): mbind(p=0x151a0b14f000, size=4294967296) error=1 "Operation not permitted"

[28] impi_shm_mbind_local(): mbind(p=0x14c1ba7ca000, size=4294967296) error=1 "Operation not permitted"

[29] impi_shm_mbind_local(): mbind(p=0x14748dc36000, size=4294967296) error=1 "Operation not permitted"

[30] impi_shm_mbind_local(): mbind(p=0x152665c36000, size=4294967296) error=1 "Operation not permitted"

[31] impi_shm_mbind_local(): mbind(p=0x147794583000, size=4294967296) error=1 "Operation not permitted"

[0] MPI startup(): libfabric version: 1.13.2rc1-impi
Abort(1615247) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(176)........:
MPID_Init(1525)..............:
MPIDI_OFI_mpi_init_hook(1516):
open_fabric(2573)............: OFI fi_getinfo() failed (ofi_init.c:2573:open_fabric:No data available)

------------------------------------------------------------
Sender: LSF System <lsfadmin@sc1nc097is10>
Subject: Job 19222504: <fhibench> in cluster <sc1> Done

Job <fhibench> was submitted from host <sc1nc076is02> by user <dipasqua> in cluster <sc1> at Wed Mar 15 09:20:24 2023
Job was executed on host(s) <32*sc1nc097is10>, in queue <short>, as user <dipasqua> in cluster <sc1> at Wed Mar 15 09:20:24 2023
</home/dipasqua> was used as the home directory.
</projects/site/gred/smpg/test/fhiaims_test> was used as the working directory.
Started at Wed Mar 15 09:20:24 2023
Terminated at Wed Mar 15 09:20:29 2023
Results reported at Wed Mar 15 09:20:29 2023

Your job looked like:

------------------------------------------------------------
# LSBATCH: User input
#!/bin/bash
#BSUB -J fhibench
#BSUB -n 32
#BSUB -q short
#BSUB -R rusage[mem=4G]
#BSUB -R span[hosts=1]
#BSUB -R affinity[core(1):cpubind=core:membind=localonly:distribute=pack]
#BSUB -o fhibench.o%J
#BSUB -e fhibench.e%J
export OMP_NUM_THREADS=1
export MKL_NUM_THREADS=1
export MKL_DYNAMIC=FALSE
# export UCX_TLS=sm,rc_verbs,rc_mlx5_2,dc_verbs,dc_mlx5_2,ud_verbs,ud_mlx5_2,self
export LD_PRELOAD=$I_MPI_ROOT/lib/libmpi_shm_heap_proxy.so
export I_MPI_HYDRA_BOOTSTRAP=lsf
export I_MPI_HYDRA_RMK=lsf
export I_MPI_HYDRA_TOPOLIB=hwloc
# export I_MPI_HYDRA_IFACE=ib0
# export FI_SOCKETS_IFACE=ib0
# export FI_PROVIDER_PATH=/projects/site/gred/smpg/software/oneAPI/2023/mpi/2021.8.0/libfabric/lib/prov:/usr/lib64/
# export FI_PROVIDER=mlx
# export I_MPI_PLATFORM=clx-ap
# export I_MPI_EXTRA_FILESYSTEM=1
# export I_MPI_EXTRA_FILESYSTEM_FORCE=gpfs
# export I_MPI_FABRICS=shm:ofi
# export I_MPI_SHM=clx-ap
export I_MPI_SHM_HEAP=1
# export I_MPI_OFI_PROVIDER=mlx
export I_MPI_PIN_CELL=core
export I_MPI_DEBUG=6
# export FI_LOG_LEVEL=debug
mpirun -n 32 ./aims.x.openAPI2022.3 2>&1 | tee FHIaims.out
------------------------------------------------------------

Successfully completed.

Resource usage summary:

CPU time : 25.00 sec.
Max Memory : 1082 MB
Average Memory : 812.00 MB
Total Requested Memory : 131072.00 MB
Delta Memory : 129990.00 MB
Max Swap : -
Max Processes : 9
Max Threads : 11
Run time : 6 sec.
Turnaround time : 5 sec.

The output (if any) is above this job summary.




PS:

Read file <fhibench.e19222504> for stderr output of this job.

 

 

 

0 Kudos
ShivaniK_Intel
Moderator
3,093 Views


Hi,


Could you please build the aims.x.openAPI2022.3 code on a different machine with the interpreter in the expected location (/lib64/ld-linux-x86-64.so.2) 


Could you also please try building IMB-MPI1 on the system where you build your executable and try if this IMB still runs?


To build IMB: 

a. copy $I_MPI_ROOT/benchmarks/imb to his home directory 


b. $ cd imb


c. $ export I_MPI_CXX=icpc

  $ export I_MPI_CC=icc 


d. $ make IMB-MPI1 


==> IMB-MPI1 executable. 


e. $ check if it uses the custom interpreter 


$ file ./IMB-MPI1 


Thanks & Regards

Shivani


0 Kudos
ShivaniK_Intel
Moderator
3,071 Views

Hi,


As we did not hear back from you could you please respond to my previous post?


Thanks & Regards

Shivani


0 Kudos
Antonio_D
New Contributor I
3,057 Views

I have been away from my email for the past week and will begin working on this today and get back to you.

 

Regards,

Antonio

0 Kudos
Antonio_D
New Contributor I
3,051 Views

Okay, I compiled IMB-MPI1 on my machine and it fails to run with the same errors as I see in my program:

 

[dipasqua@ec-hub1-sc1 imb]$ file ./IMB-MPI1
./IMB-MPI1: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /apps/rocs/2020.08/prefix//lib64/ld-linux-x86-64.so.2, for GNU/Linux 3.2.0, BuildID[sha1]=41daf3009e47bbd1316486d31cc4a209c06a2964, with debug_info, not stripped
[dipasqua@ec-hub1-sc1 imb]$ ./IMB-MPI1
Abort(1615247) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init_thread: Other MPI error, error stack:
MPIR_Init_thread(176)........:
MPID_Init(1525)..............:
MPIDI_OFI_mpi_init_hook(1516):
open_fabric(2573)............: OFI fi_getinfo() failed (ofi_init.c:2573:open_fabric:No data available)
[unset]: write_line error; fd=-1 buf=:cmd=abort exitcode=1615247
:
system msg for write_line failure : Bad file descriptor
Abort(1615247) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init_thread: Other MPI error, error stack:
MPIR_Init_thread(176)........:
MPID_Init(1525)..............:
MPIDI_OFI_mpi_init_hook(1516):
open_fabric(2573)............: OFI fi_getinfo() failed (ofi_init.c:2573:open_fabric:No data available)
[unset]: write_line error; fd=-1 buf=:cmd=abort exitcode=1615247
:
system msg for write_line failure : Bad file descriptor
Segmentation fault (core dumped)
[dipasqua@ec-hub1-sc1 imb]$

It looks like it is using the dynamically linked interpreter, which may be the source of the issue.  I will play around with my modules and environment to see if I can fix this.

 

Regards,

Antonio 

0 Kudos
Antonio_D
New Contributor I
7,987 Views

I figured out the solution to my problem.  The incorrect interpreter was being used, so I forced the correct interpreter by adding the following:

-dynamic-linker=/lib64/ld-linux-x86-64.so.2

to my FFLAGS and CFLAGS line before compiling.

Thanks for all the help! 

0 Kudos
ShivaniK_Intel
Moderator
3,027 Views

Hi,


Thanks for accepting our solution. If you need any additional information please post a new question as this thread will no longer be monitored by Intel.


Thanks & Regards

Shivani


0 Kudos
Reply