Performance of MPI_Scatterv on Linux with OneAPI much slower compared to other MPI implementations

nhm · ‎02-22-2024

Related to this issue, I had posted earlier as well. But the answer wasn't convincing. Now I have compared the timings with several other MPI implementations.

@https://community.intel.com/t5/Intel-HPC-Toolkit/MPI-Scatterv-using-MPI-data-types-is-much-slower-23-times/m-p/1494364#M10677

Here is the example code I used to benchmark. After the code, I have reported the time Scatterv operation took with different compilers. We can note that the Intel MPI version 2021.11 and 2021.9 on Linux, take an order of magnitude more time compared to the Intel MPI version 2021.6 on Windows and other MPI implementations. Benchmark results are reported with 8 cores.

program test_scatterv_gatherv
  use mpi 
  implicit none 
  integer(kind=4), allocatable, dimension(:,:,:,:) :: global, local 
  integer, allocatable, dimension(:) :: scnts, displs
  integer :: m1, m2, m3, m4, m1_local, str_idx, end_idx 
  integer :: mpierr, rank, nprocs, ierr
  real(kind=8) :: str_time

  m1 = 1000; m2=100; m3=100; m4=100

  call mpi_init(mpierr)
  call mpi_comm_size(MPI_COMM_WORLD, nprocs, mpierr)
  call mpi_comm_rank(MPI_COMM_WORLD, rank, mpierr)

  call domain_decompose(m1, rank, nprocs, str_idx, end_idx)
  m1_local = end_idx - str_idx + 1

  if(rank .eq. 0) then 
    allocate(global(m1, m2, m3, m4))
    global = 10
  endif

  allocate(local(0:m1_local + 1, m2, m3, m4))
  allocate(scnts(nprocs), displs(nprocs), stat=ierr)

  call MPI_Allgather(str_idx, 1, MPI_INTEGER, displs, 1, MPI_INTEGER, &
  MPI_COMM_WORLD, mpierr)
  call MPI_Allgather(m1_local, 1, MPI_INTEGER, scnts, 1, &
  MPI_INTEGER, MPI_COMM_WORLD, mpierr)
  displs = displs - 1
  ! print*, rank, m1_local, str_idx, end_idx, scnts, displs

  str_time = MPI_Wtime()
  call scatter4D_arr(m1_local, m1, m2, m3, m4, scnts, displs, global, &
  local, rank)
  print*, "Time taken by scatterv operation from rank", rank, "time: ", &
  MPI_Wtime() - str_time

  call MPI_Finalize(mpierr)

  contains 
  subroutine domain_decompose(npts, rank, size, sidx, eidx)
    implicit none

    integer, intent(in) :: npts, size, rank
    integer, intent(out) :: sidx, eidx
    integer :: pts_per_proc

    pts_per_proc = npts/size

    if(rank < mod(npts, size)) then
      pts_per_proc=pts_per_proc + 1
    end if

    if(rank < mod(npts, size)) then
      sidx = rank * pts_per_proc + 1
      eidx = (rank + 1) * pts_per_proc
    else
      sidx = mod(npts, size) + rank*pts_per_proc + 1
      eidx = mod(npts, size) + (rank + 1) * pts_per_proc
    end if
  end subroutine domain_decompose

  subroutine scatter4D_arr(n1_local, n1, n2, n3, n4, scounts, &
    displs, g_arr, l_arr, rank)
    implicit none
    integer, intent(in) :: n1_local, n1, n2, n3, n4, rank
    integer, intent(in) :: scounts(:), displs(:)
    integer(4), intent(in) :: g_arr(n1, n2, n3, n4)
    integer(4), intent(inout) :: l_arr(0:n1_local+1, n2, n3, n4)

    integer :: ssizes(4), s_ssizes(4), sstarts(4), &
    rsizes(4), r_ssizes(4), rstarts(4), stype, rtype, &
    resize_stype, resize_rtype
    integer(kind=MPI_ADDRESS_KIND) :: lb, extent

    ssizes = [n1, n2, n3, n4]
    s_ssizes = [1, n2, n3, n4]
    rsizes = [n1_local+2, n2, n3, n4]
    r_ssizes = [1, n2, n3, n4]
    sstarts = [0, 0, 0, 0]
    rstarts = [1, 0, 0, 0]

    call MPI_Type_get_extent(MPI_INTEGER4, lb, extent, mpierr)

    !create a mpi subarray data type for sending data
    call MPI_Type_create_subarray(4, ssizes, s_ssizes, sstarts, &
    MPI_ORDER_FORTRAN, MPI_INTEGER4, stype, mpierr)

    !resize the send subarray for starting at correct location for next send
    call MPI_Type_create_resized(stype, lb, extent, &
    resize_stype, mpierr)
    call MPI_Type_commit(resize_stype, mpierr)

    !create a mpi subarray data type for receiving data
    call MPI_Type_create_subarray(4, rsizes, r_ssizes, rstarts, &
    MPI_ORDER_FORTRAN, MPI_INTEGER4, rtype, mpierr)

    !resize the receive subarray for starting at correct location for next receive
    call MPI_Type_create_resized(rtype, lb, extent, &
    resize_rtype, mpierr)
    call MPI_Type_commit(resize_rtype, mpierr)

    call MPI_Scatterv(g_arr, scounts, displs, resize_stype, &
    l_arr, scounts(rank), resize_rtype, 0, MPI_COMM_WORLD, mpierr)

  end subroutine scatter4D_arr
end program test_scatterv_gatherv

Time taken by scatterv operation from rank 5 time: 141.779704116954

Time taken by scatterv operation from rank 1 time: 142.803899861436

Time taken by scatterv operation from rank 6 time: 142.937027430511

Time taken by scatterv operation from rank 7 time: 143.634875432414

Time taken by scatterv operation from rank 4 time: 144.668074302084

Time taken by scatterv operation from rank 2 time: 145.893039251067

Time taken by scatterv operation from rank 3 time: 146.873979726137

Time taken by scatterv operation from rank 0 time: 146.875087157910

Intel(R) MPI Library for Linux* OS, Version 2021.11 Build 20231005 (id: 74c4a23)

-------------------------------------------------------------------------------

Time taken by scatterv operation from rank 1 time: 143.974908194359

Time taken by scatterv operation from rank 5 time: 145.971094210050

Time taken by scatterv operation from rank 2 time: 147.082525824866

Time taken by scatterv operation from rank 7 time: 147.119157463982

Time taken by scatterv operation from rank 6 time: 147.346174288017

Time taken by scatterv operation from rank 3 time: 147.468452539848

Time taken by scatterv operation from rank 0 time: 147.473907662381

Time taken by scatterv operation from rank 4 time: 148.304317163012

Intel(R) MPI Library for Linux* OS, Version 2021.9 Build 20230307 (id: d82b3071db)

-------------------------------------------------------------------------------

Time taken by scatterv operation from rank 1 time: 71008950000396

Time taken by scatterv operation from rank 2 time: 9.83219929999905

Time taken by scatterv operation from rank 3 time: 11.1024133999890

Time taken by scatterv operation from rank 7 time: 12.0688114999793

Time taken by scatterv operation from rank 5 time: 12.2149914999900

Time taken by scatterv operation from rank 6 time: 12.3005766000133

Time taken by scatterv operation from rank 4 time: 13.3717440999753

Time taken by scatterv operation from rank 0 time: 13.5266269000131

Intel(R) MPI Library for Windows* OS, Version 2021.6 Build 20220227

-------------------------------------------------------------------------------

Time taken by scatterv operation from rank 2 time: 14.251014514999952

Time taken by scatterv operation from rank 1 time: 14.723500717999968

Time taken by scatterv operation from rank 3 time: 14.723493416000110

Time taken by scatterv operation from rank 4 time: 14.726896067000098

Time taken by scatterv operation from rank 5 time: 14.726953725999920

Time taken by scatterv operation from rank 0 time: 14.728668044000187

Time taken by scatterv operation from rank 6 time: 14.728649945999905

Time taken by scatterv operation from rank 7 time: 14.728669964000119

mpich version 4.1.2

--------------------------------------------------------------------------------

Time taken by scatterv operation from rank 1 time: 2.6276495560014155

Time taken by scatterv operation from rank 2 time: 4.2920855869888328

Time taken by scatterv operation from rank 3 time: 6.6306728819909040

Time taken by scatterv operation from rank 4 time: 8.4632198559993412

Time taken by scatterv operation from rank 5 time: 9.9578672840143554

Time taken by scatterv operation from rank 6 time: 11.341454203007743

Time taken by scatterv operation from rank 7 time: 12.863722890004283

Time taken by scatterv operation from rank 0 time: 12.863704065995989

Open MPI 1.10.7

-------------------------------------------------------------------------------

Time taken by scatterv operation from rank 2 time: 1.1417472362518311

Time taken by scatterv operation from rank 3 time: 2.3085207939147949

Time taken by scatterv operation from rank 4 time: 3.5863912105560303

Time taken by scatterv operation from rank 5 time: 4.8343582153320312

Time taken by scatterv operation from rank 6 time: 6.0975248813629150

Time taken by scatterv operation from rank 7 time: 7.3300142288208008

Time taken by scatterv operation from rank 0 time: 8.5495507717132568

Time taken by scatterv operation from rank 1 time: 8.5495731830596924

mpich version 3.0.4

TobiasK · ‎02-29-2024

@nhm

I can only give you the same advice as in the old post, you are using a single element in the leading dimension to declare your subarray. This will break performance. If you care about performance, create a temporary array, pack the data by yourself and unpack it after receiving the data again. It is much faster than using a library to figure out what you are trying to do.

For what it matters, also the other implementations are utterly slow, what you are doing is essentially a memory copy and it takes 8s to complete for ~3.7GB.

Additionally, there are bugs in your code, I spotted two but -check_mpi is till reporting errors.

1)

integer, intent(in) :: scounts(:)

but you access scounts with rank which starts at 0, the lower bound of scounts is 1 in your case.

2)

integer(4), intent(in) :: g_arr(n1, n2, n3, n4)

g_arr is only allocated on rank 0, you are not allowed to pass an unallocated array to a subroutine in Fortran.

Best

Tobias