performance drop in RMA (MPI_PUT&MPI_GET) with mlx provider

Csea1122 · ‎11-15-2023

I discovered a performance issue with RMA, which is described as follows:

When my window size exceeds 2GB, I discovered the performance of MPI_PUT and MPI_GET will be very low when using provider=mlx(compare to verbs Or psm3),
And My test code is listing as follows:
module mpi_data
integer rank,np,ierr,winData
end module

program main
use mpi_data
implicit none
integer i
include 'mpif.h'
call init_mpi(ierr)
call mpi_comm_rank(mpi_comm_world,rank,ierr)
call mpi_comm_size(mpi_comm_world,np,ierr)
call mpi_main
call finish_mpi(ierr)
stop
end
subroutine mpi_main()
use mpi_data
implicit none
include 'mpif.h'
complex,allocatable ::cdata(:),data_tmp(:)
integer*8 n8,s8,d8
integer repeat,i
n8 = 1024*1024*1024*0.2
repeat = 100000
d8 = 0
s8 = 1000
if(rank==2)then
allocate(cdata(n8),data_tmp(s8),STAT=ierr)
cdata(1:n8)=0.0
call MPI_Win_create(cdata,int8(8*n8),8,MPI_INFO_NULL,mpi_comm_world,winData,ierr)
else
allocate(cdata(1),data_tmp(s8),STAT=ierr)
call MPI_Win_create(cdata,int8(8*1),8,MPI_INFO_NULL,mpi_comm_world,winData,ierr)
end if

call MPI_Win_fence( 0 , winData,ierr)
do i=1,repeat
write(*,*)i,repeat
if(mod(i,3)==0)call data_gpa(data_tmp,d8,s8,0)
if(mod(i,3)==1)call data_gpa(data_tmp,d8,s8,1)
if(mod(i,3)==2)call data_gpa(data_tmp,d8,s8,2)
end do
call MPI_Win_fence( 0 , winData,ierr)

deallocate(cdata,data_tmp)
call MPI_Win_free(winData,ierr)
end subroutine

subroutine data_gpa(data_tmp,d8,s8,type0)
use mpi_data
implicit none
include 'mpif.h'
integer level,type0
integer*8 d8,s8,i8
complex data_tmp(*)
if(rank.ne.2)then
if(type0==0)call MPI_Win_lock(MPI_LOCK_SHARED,2,0,winData,ierr)
if(type0==1)call MPI_Win_lock(MPI_LOCK_EXCLUSIVE,2,0,winData,ierr)
if(type0==2)call MPI_Win_lock(MPI_LOCK_EXCLUSIVE,2,0,winData,ierr)
if(type0==0)call mpi_get(data_tmp,s8,MPI_COMPLEX,2,d8,s8,MPI_COMPLEX,winData,ierr)
if(type0==1)call mpi_put(data_tmp,s8,MPI_COMPLEX,2,d8,s8,MPI_COMPLEX,winData,ierr)
if(type0==2)call mpi_accumulate(data_tmp,s8,MPI_COMPLEX,2,d8,s8,MPI_COMPLEX,mpi_sum,winData,ierr)
call MPI_Win_unlock(2,winData,ierr)
endif
end subroutine

AishwaryaCV_Intel · ‎11-19-2023

Hi,

Thank you for posting in Intel Community.

Could you please provide the following details , so that we can reproduce the issue at our end:

OS and Hardware details.
CPU details.
Intel MPI version.
Compiler used to run the test code.
Steps followed to run and execute the test code.
Could you please inform us about the methods you employed to assess the performance of mlx in comparison to other providers?

Thanks and regards,

Aishwarya

Csea1122 · ‎11-20-2023

OS :

centos7.6

#run.sh

export UCX_NET_DEVICES=mlx5_0:1

export I_MPI_FABRICS=shm:ofi

export FI_PROVIDER=verbs

mpirun -np 120 -machinefile ./host9_11 ./main

host9_11:

comput9

comput10

comput11

compiler：intel-2021.3.0

mpi：intelmpi-2021.10.0 or intelmpi-2021.3.0

compiler options:

ifort_flags=-g -Wall -O3 -fp-model precise -qopenmp –c

lscpu:

Architecture: x86_64

CPU op-mode(s): 32-bit, 64-bit

Byte Order: Little Endian

CPU(s): 64

On-line CPU(s) list: 0-63

Thread(s) per core: 1

Core(s) per socket: 32

Socket(s): 2

NUMA node(s): 8

Vendor ID: AuthenticAMD

CPU family: 25

Model: 1

Model name: AMD EPYC 7543 32-Core Processor

Stepping: 1

CPU MHz: 2800.000

CPU max MHz: 2800.0000

CPU min MHz: 1500.0000

BogoMIPS: 5600.05

Virtualization: AMD-V

L1d cache: 32K

L1i cache: 32K

L2 cache: 512K

L3 cache: 32768K

NUMA node0 CPU(s): 0-7

NUMA node1 CPU(s): 8-15

NUMA node2 CPU(s): 16-23

NUMA node3 CPU(s): 24-31

NUMA node4 CPU(s): 32-39

NUMA node5 CPU(s): 40-47

NUMA node6 CPU(s): 48-55

NUMA node7 CPU(s): 56-63

AishwaryaCV_Intel · ‎11-23-2023

Hi,

We have used code provided by you , and added MPI_Wtime() to get the timing for comparing performance between different providers. Please find the attached zip file for code.

We have compiled the code with Intel MPI 2021.10 version as following:

mpiifort -g -Wall -O3 -fp-model precise -qopenmp mpi_putget.f90 -o putget_new.out
bash run1.sh

Could you please let us know if you have also followed the same step's and method to compare performance between the providers ? If not , please let us know the method you used for comparing the performance?

Thanks and regards,

Aishwarya

Csea1122 · ‎11-23-2023

I have used the three scripts(run_with_mlx.rar/run_with_verbs.rar/run_with_psm3.rar) in the attachment for performance comparison.The large process number(>=100 or >=300) have huge performance difference between the three scripts.

And the hosts are list below.

AishwaryaCV_Intel · ‎11-30-2023

Hi,

We have informed the team concerned regarding your issue and we are working on it internally. We will get back to you soon.

Thanks and regards,

Aishwarya

Csea1122 · ‎05-30-2024

have your team resolved the problem yet?