MPI_Scatterv using MPI data types is much slower (23 times) compared to flattening array and scatter

nhm · ‎06-09-2023

ifort version: (IFORT) 2021.8.0 20221119

1. I have a (1000x1000x1000) 3D array to distribute among procs. In the first approach, I flatten the array and then distribute arrays among procs and it takes about 7.86sec

2. In the second approach I use MPI-derived data types to Scatterv the 3D array and I noticed that it takes about 165.34sec. But Gatherv of the same data took about 14.24 sec.

3. What could be the reason for this inconsistency? I expected Scatterv to take similar time as Gatherv

Here is the code

program ex_scatterv
use mpi
use iso_fortran_env, only : real64
implicit none

!allocate arrays
real(real64), allocatable,dimension(:,:,:) :: array, array_local
real(real64), allocatable,dimension(:) :: array_flat, array_local_flat
integer :: rank, num_procs, i, j, k
integer :: nx, ny, nz, str_idx, end_idx, local_size, local_size_flat
integer, dimension(:), allocatable :: sendcounts, displacements
integer :: sizes(3), sub_sizes(3), starts(3), recv_starts(3), recv_sizes(3), &
send_type, resize_send_type, recv_type, resize_recv_type
integer(kind=8) :: lb, extent, lb_resize
real(real64) :: start_time
integer :: mpierr
call mpi_init(mpierr)
call mpi_comm_size(mpi_comm_world, num_procs, mpierr)
call mpi_comm_rank(mpi_comm_world, rank, mpierr)

!size of array
nx=1000
ny=1000
nz=1000

if(rank==0) then
if(num_procs>nx) then
print*, "Number of procs should be less than or equal to first dimension of the array"
call MPI_Abort(mpi_comm_world, 1, mpierr)
endif
endif

start_time=MPI_Wtime()
!allocate in the root rank
if(rank==0) then
allocate(array(nx,ny,nz))
allocate(array_flat(nx*ny*nz))
else !for other procs allocate with zero size
allocate(array(0,0,0))
endif

!assign values to the array
if(rank==0) then
do k=1,nz
do j=1,ny
do i=1,nx
array(i,j,k) = (i-1)+(j-1)*nx+(k-1)*nx*ny
end do
end do
end do
!print*, "Before scattering..."
!print*, array
!flatten the 3D array
forall(k=1:nz, j=1:ny, i=1:nx) array_flat(k+(j-1)*nz+(i-1)*ny*nz)=array(i,j,k)
endif

!distribute the 3d array among different procs
call distribute_points(nx, rank, num_procs, str_idx, end_idx)
local_size = end_idx - str_idx + 1
local_size_flat = local_size*ny*nz

!allocate local(for each rank) arrays
allocate(array_local_flat(local_size_flat))
allocate(array_local(local_size, ny, nz))

!allocate sendcoutns and displacements arrays for braodcasting
allocate(sendcounts(num_procs), displacements(num_procs))

!gather displacements and sendcounts for all ranks
call MPI_Allgather(str_idx, 1, MPI_INTEGER, displacements, 1, MPI_INTEGER, &
MPI_COMM_WORLD, mpierr)
call MPI_Allgather(local_size, 1, MPI_INTEGER, sendcounts, 1, &
MPI_INTEGER, MPI_COMM_WORLD, mpierr)

!total sendcounts and displacements
sendcounts = sendcounts*ny*nz
displacements = displacements - 1 !Array index starts with 0 in MPI (C)
displacements = displacements*ny*nz

!scatter the flattened array among procs
call MPI_Scatterv(array_flat, sendcounts, displacements, MPI_DOUBLE_PRECISION, &
array_local_flat, local_size*ny*nz, MPI_DOUBLE_PRECISION, 0, MPI_COMM_WORLD, &
mpierr)

!form 3D array from flattened local array
forall(k=1:nz, j=1:ny, i=1:local_size) array_local(i,j,k) = &
array_local_flat(k+(j-1)*nz+(i-1)*ny*nz)

!print*, "Scattered array: ", rank
!print*, array_local
if(rank==0) then
print*, "Time taken by flatten and scatter: ", MPI_Wtime()-start_time
endif

call MPI_Barrier(mpi_comm_world, mpierr)
!deallocate(array_flat, array_local_flat)

start_time=MPI_Wtime()
!Scatterning using subarray type
sizes = [nx, ny, nz]
recv_sizes=[local_size, ny, nz]
sub_sizes = [1, ny, nz]
starts = [0, 0, 0]
recv_starts = [0, 0, 0]

!to get extent of MPI_DOUBLE_PRECISION
call MPI_Type_get_extent(MPI_DOUBLE_PRECISION, lb, extent, mpierr)

!create a mpi subarray data type for sending data
call MPI_Type_create_subarray(3, sizes, sub_sizes, starts, &
MPI_ORDER_FORTRAN, MPI_DOUBLE_PRECISION, send_type, mpierr)
lb_resize=0
!resize the send subarray for starting at correct location for next send
call MPI_Type_create_resized(send_type, lb_resize, extent, &
resize_send_type, mpierr)
call MPI_Type_commit(resize_send_type, mpierr)

!create a mpi subarray data type for receiving data
call MPI_Type_create_subarray(3, recv_sizes, sub_sizes, recv_starts, &
MPI_ORDER_FORTRAN, MPI_DOUBLE_PRECISION, recv_type, mpierr)

!resize the receive subarray for starting at correct location for next receive
call MPI_Type_create_resized(recv_type, lb_resize, extent, &
resize_recv_type, mpierr)
call MPI_Type_commit(resize_recv_type, mpierr)

!sendcounts and displacement for sending and receiving subarrays
sendcounts=sendcounts/(ny*nz)
displacements = displacements/(ny*nz)

if(rank==0) then
print*, "Time taken for creating MPI type subarrays: ", MPI_Wtime()-start_time
endif

call MPI_Barrier(mpi_comm_world, mpierr)
start_time=MPI_Wtime()
!scatter the subarrays
call MPI_Scatterv(array, sendcounts, displacements, resize_send_type, &
array_local, sendcounts, resize_recv_type, 0, MPI_COMM_WORLD, mpierr)

if(rank==0) then
print*, "Time taken for scattering using MPI type subarrays: ", MPI_Wtime()-start_time
endif
call MPI_Barrier(mpi_comm_world, mpierr)
!print the scattered array
!print*, "Scattered array with subarray: ", rank
!print*, array_local

!do some computations on the scattered local arrays
array_local = array_local+1

call MPI_Barrier(mpi_comm_world, mpierr)
start_time=MPI_Wtime()
!Gather the local arrays to global (array) using the same subarrays
call MPI_Gatherv(array_local, local_size, resize_recv_type, array, &
sendcounts, displacements, resize_send_type, 0, MPI_COMM_WORLD, mpierr)

if(rank==0) then
print*, "Time taken by MPI_Type_create_subarray Gathering: ", MPI_Wtime()-start_time
endif

!if(rank==0) then
! print*, "Gathered array: ------------------"
! print*, array
!endif
call MPI_Finalize(mpierr)

contains

subroutine distribute_points(npts, rank, size, start_idx, end_idx)
implicit none

integer, intent(in) :: npts, size, rank
integer, intent(out) :: start_idx, end_idx
integer :: pts_per_proc

pts_per_proc = npts/size

if(rank < mod(npts, size)) then
pts_per_proc=pts_per_proc + 1
end if

if(rank < mod(npts, size)) then
start_idx = rank * pts_per_proc + 1
end_idx = (rank + 1) * pts_per_proc
else
start_idx = mod(npts, size) + rank*pts_per_proc + 1
end_idx = mod(npts, size) + (rank + 1) * pts_per_proc
end if

end subroutine distribute_points

end program ex_scatterv

AishwaryaCV_Intel · ‎06-12-2023

Hi,

Thank you for posting in intel communities.

Could you please let us know the OS and hardware details of the machine you are using?

Thanks And Regards,

Aishwarya

nhm · ‎06-12-2023

OS: Ubuntu 20.04 LTS (on WSL)

Hardware: Intel i7 CPU, 24 cores, 64GB RAM

AishwaryaCV_Intel · ‎06-20-2023

Hi,

Thanks for providing information, we are working on your issue and will get back to you soon.

Thanks And Regards,

Aishwarya

AishwaryaCV_Intel · ‎06-21-2023

Hi,

Could you please let us know where you are running your code , is it a cluster?

Because I have encountered a similar issue when I run the code on login node of the cluster (which is not recommended) However, when I assigned specific nodes and run the code, I noticed that the timings were nearly similar.

Thanks And Regards,

Aishwarya

nhm · ‎06-21-2023

Hi,

I am running on Laptop (Windows Subsytem for Linux).

Regards,

nhm

AishwaryaCV_Intel · ‎06-22-2023

Hi,

If you intend to measure timing in an MPI application, it is essential to utilize MPI_BARRIER before invoking any time function. MPI programs inherently operate asynchronously, and even collective operations do not ensure synchronization among ranks. In this specific situation, the absence of a barrier before measuring the end time can result in the root rank completing before the other ranks.

The performance issue associated with MPI_derived data type arises when generating a data type with a 1000-element stride between its elements. Consequently, any operation involving memory becomes significantly slower. However, in cases involving network usage, the impact might be somewhat mitigated as network operations are typically slower than memory operations on a single node.

>>> What could be the reason for this inconsistency? I expected Scatterv to take similar time as Gatherv

Regarding the scatterv/gatherv timing difference, the first is a one - to many and the second is a many to one operation. There is no guarantee that they should perform equal.

To enhance performance, please find the following approaches:

Instead of utilizing vector collectives, opt for scatter/gather operations, which tend to be more optimized.
Avoid having a single element in the first dimension of the derived data type with a large stride.

I have attached a modified source code(zip file) where the first dimension is evenly divided by the available number of processors. This ensures that the sendcount is set to 1 and eliminates the significant stride between elements. However, it is important to note that this approach imposes additional constraints on the number of processors and may require padded arrays. Consequently, the vector versions become obsolete since the displacements and sendcounts are now equal.

Thanks And Regards,

Aishwarya

AishwaryaCV_Intel · ‎06-29-2023

Hi,

We haven't heard back from you, could you please let us know whether your issue is resolved or not. If yes, make sure to accept this as a solution.

Thanks And Regards,

Aishwarya

AishwaryaCV_Intel · ‎07-06-2023

Hi,

We assume that your issue is resolved. If you need any additional information, please post a new question as this thread will no longer be monitored by Intel.

Thanks and Regards,

Aishwarya

MPI_Scatterv using MPI data types is much slower (23 times) compared to flattening array and scatter

Fortran

MPI

Performance