- 新着としてマーク
- ブックマーク
- 購読
- ミュート
- RSS フィードを購読する
- ハイライト
- 印刷
- 不適切なコンテンツを報告
I discovered a performance issue with RMA, which is described as follows:
When my window size exceeds 2GB, I discovered the performance of MPI_PUT and MPI_GET will be very low when using provider=mlx(compare to verbs Or psm3),
And My test code is listing as follows:
module mpi_data
integer rank,np,ierr,winData
end module
program main
use mpi_data
implicit none
integer i
include 'mpif.h'
call init_mpi(ierr)
call mpi_comm_rank(mpi_comm_world,rank,ierr)
call mpi_comm_size(mpi_comm_world,np,ierr)
call mpi_main
call finish_mpi(ierr)
stop
end
subroutine mpi_main()
use mpi_data
implicit none
include 'mpif.h'
complex,allocatable ::cdata(:),data_tmp(:)
integer*8 n8,s8,d8
integer repeat,i
n8 = 1024*1024*1024*0.2
repeat = 100000
d8 = 0
s8 = 1000
if(rank==2)then
allocate(cdata(n8),data_tmp(s8),STAT=ierr)
cdata(1:n8)=0.0
call MPI_Win_create(cdata,int8(8*n8),8,MPI_INFO_NULL,mpi_comm_world,winData,ierr)
else
allocate(cdata(1),data_tmp(s8),STAT=ierr)
call MPI_Win_create(cdata,int8(8*1),8,MPI_INFO_NULL,mpi_comm_world,winData,ierr)
end if
call MPI_Win_fence( 0 , winData,ierr)
do i=1,repeat
write(*,*)i,repeat
if(mod(i,3)==0)call data_gpa(data_tmp,d8,s8,0)
if(mod(i,3)==1)call data_gpa(data_tmp,d8,s8,1)
if(mod(i,3)==2)call data_gpa(data_tmp,d8,s8,2)
end do
call MPI_Win_fence( 0 , winData,ierr)
deallocate(cdata,data_tmp)
call MPI_Win_free(winData,ierr)
end subroutine
subroutine data_gpa(data_tmp,d8,s8,type0)
use mpi_data
implicit none
include 'mpif.h'
integer level,type0
integer*8 d8,s8,i8
complex data_tmp(*)
if(rank.ne.2)then
if(type0==0)call MPI_Win_lock(MPI_LOCK_SHARED,2,0,winData,ierr)
if(type0==1)call MPI_Win_lock(MPI_LOCK_EXCLUSIVE,2,0,winData,ierr)
if(type0==2)call MPI_Win_lock(MPI_LOCK_EXCLUSIVE,2,0,winData,ierr)
if(type0==0)call mpi_get(data_tmp,s8,MPI_COMPLEX,2,d8,s8,MPI_COMPLEX,winData,ierr)
if(type0==1)call mpi_put(data_tmp,s8,MPI_COMPLEX,2,d8,s8,MPI_COMPLEX,winData,ierr)
if(type0==2)call mpi_accumulate(data_tmp,s8,MPI_COMPLEX,2,d8,s8,MPI_COMPLEX,mpi_sum,winData,ierr)
call MPI_Win_unlock(2,winData,ierr)
endif
end subroutine
コピーされたリンク
- 新着としてマーク
- ブックマーク
- 購読
- ミュート
- RSS フィードを購読する
- ハイライト
- 印刷
- 不適切なコンテンツを報告
Hi,
Thank you for posting in Intel Community.
Could you please provide the following details , so that we can reproduce the issue at our end:
- OS and Hardware details.
- CPU details.
- Intel MPI version.
- Compiler used to run the test code.
- Steps followed to run and execute the test code.
- Could you please inform us about the methods you employed to assess the performance of mlx in comparison to other providers?
Thanks and regards,
Aishwarya
- 新着としてマーク
- ブックマーク
- 購読
- ミュート
- RSS フィードを購読する
- ハイライト
- 印刷
- 不適切なコンテンツを報告
OS :
centos7.6
#run.sh
export UCX_NET_DEVICES=mlx5_0:1
export I_MPI_FABRICS=shm:ofi
export FI_PROVIDER=verbs
mpirun -np 120 -machinefile ./host9_11 ./main
host9_11:
comput9
comput10
comput11
compiler:intel-2021.3.0
mpi:intelmpi-2021.10.0 or intelmpi-2021.3.0
compiler options:
ifort_flags=-g -Wall -O3 -fp-model precise -qopenmp –c
lscpu:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 64
On-line CPU(s) list: 0-63
Thread(s) per core: 1
Core(s) per socket: 32
Socket(s): 2
NUMA node(s): 8
Vendor ID: AuthenticAMD
CPU family: 25
Model: 1
Model name: AMD EPYC 7543 32-Core Processor
Stepping: 1
CPU MHz: 2800.000
CPU max MHz: 2800.0000
CPU min MHz: 1500.0000
BogoMIPS: 5600.05
Virtualization: AMD-V
L1d cache: 32K
L1i cache: 32K
L2 cache: 512K
L3 cache: 32768K
NUMA node0 CPU(s): 0-7
NUMA node1 CPU(s): 8-15
NUMA node2 CPU(s): 16-23
NUMA node3 CPU(s): 24-31
NUMA node4 CPU(s): 32-39
NUMA node5 CPU(s): 40-47
NUMA node6 CPU(s): 48-55
NUMA node7 CPU(s): 56-63
- 新着としてマーク
- ブックマーク
- 購読
- ミュート
- RSS フィードを購読する
- ハイライト
- 印刷
- 不適切なコンテンツを報告
Hi,
We have used code provided by you , and added MPI_Wtime() to get the timing for comparing performance between different providers. Please find the attached zip file for code.
We have compiled the code with Intel MPI 2021.10 version as following:
mpiifort -g -Wall -O3 -fp-model precise -qopenmp mpi_putget.f90 -o putget_new.out
bash run1.sh
Could you please let us know if you have also followed the same step's and method to compare performance between the providers ? If not , please let us know the method you used for comparing the performance?
Thanks and regards,
Aishwarya
- 新着としてマーク
- ブックマーク
- 購読
- ミュート
- RSS フィードを購読する
- ハイライト
- 印刷
- 不適切なコンテンツを報告
I have used the three scripts(run_with_mlx.rar/run_with_verbs.rar/run_with_psm3.rar) in the attachment for performance comparison.The large process number(>=100 or >=300) have huge performance difference between the three scripts.
And the hosts are list below.
- 新着としてマーク
- ブックマーク
- 購読
- ミュート
- RSS フィードを購読する
- ハイライト
- 印刷
- 不適切なコンテンツを報告
Hi,
We have informed the team concerned regarding your issue and we are working on it internally. We will get back to you soon.
Thanks and regards,
Aishwarya