Intel® MPI Library
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.
2196 Discussions

Performance of MPI_Scatterv on Linux with OneAPI much slower compared to other MPI implementations

nhm
Beginner
564 Views

Related to this issue, I had posted earlier as well. But the answer wasn't convincing. Now I have compared the timings with several other MPI implementations.

@https://community.intel.com/t5/Intel-HPC-Toolkit/MPI-Scatterv-using-MPI-data-types-is-much-slower-23-times/m-p/1494364#M10677

Here is the example code I used to benchmark. After the code, I have reported the time Scatterv operation took with different compilers. We can note that the Intel MPI version 2021.11 and 2021.9 on Linux, take an order of magnitude more time compared to the Intel MPI version 2021.6 on Windows and other MPI implementations. Benchmark results are reported with 8 cores.

 

 

program test_scatterv_gatherv
  use mpi 
  implicit none 
  integer(kind=4), allocatable, dimension(:,:,:,:) :: global, local 
  integer, allocatable, dimension(:) :: scnts, displs
  integer :: m1, m2, m3, m4, m1_local, str_idx, end_idx 
  integer :: mpierr, rank, nprocs, ierr
  real(kind=8) :: str_time

  m1 = 1000; m2=100; m3=100; m4=100

  call mpi_init(mpierr)
  call mpi_comm_size(MPI_COMM_WORLD, nprocs, mpierr)
  call mpi_comm_rank(MPI_COMM_WORLD, rank, mpierr)

  call domain_decompose(m1, rank, nprocs, str_idx, end_idx)
  m1_local = end_idx - str_idx + 1

  if(rank .eq. 0) then 
    allocate(global(m1, m2, m3, m4))
    global = 10
  endif

  allocate(local(0:m1_local + 1, m2, m3, m4))
  allocate(scnts(nprocs), displs(nprocs), stat=ierr)

  call MPI_Allgather(str_idx, 1, MPI_INTEGER, displs, 1, MPI_INTEGER, &
  MPI_COMM_WORLD, mpierr)
  call MPI_Allgather(m1_local, 1, MPI_INTEGER, scnts, 1, &
  MPI_INTEGER, MPI_COMM_WORLD, mpierr)
  displs = displs - 1
  ! print*, rank, m1_local, str_idx, end_idx, scnts, displs

  str_time = MPI_Wtime()
  call scatter4D_arr(m1_local, m1, m2, m3, m4, scnts, displs, global, &
  local, rank)
  print*, "Time taken by scatterv operation from rank", rank, "time: ", &
  MPI_Wtime() - str_time

  call MPI_Finalize(mpierr)

  contains 
  subroutine domain_decompose(npts, rank, size, sidx, eidx)
    implicit none

    integer, intent(in) :: npts, size, rank
    integer, intent(out) :: sidx, eidx
    integer :: pts_per_proc

    pts_per_proc = npts/size

    if(rank < mod(npts, size)) then
      pts_per_proc=pts_per_proc + 1
    end if

    if(rank < mod(npts, size)) then
      sidx = rank * pts_per_proc + 1
      eidx = (rank + 1) * pts_per_proc
    else
      sidx = mod(npts, size) + rank*pts_per_proc + 1
      eidx = mod(npts, size) + (rank + 1) * pts_per_proc
    end if
  end subroutine domain_decompose

  subroutine scatter4D_arr(n1_local, n1, n2, n3, n4, scounts, &
    displs, g_arr, l_arr, rank)
    implicit none
    integer, intent(in) :: n1_local, n1, n2, n3, n4, rank
    integer, intent(in) :: scounts(:), displs(:)
    integer(4), intent(in) :: g_arr(n1, n2, n3, n4)
    integer(4), intent(inout) :: l_arr(0:n1_local+1, n2, n3, n4)

    integer :: ssizes(4), s_ssizes(4), sstarts(4), &
    rsizes(4), r_ssizes(4), rstarts(4), stype, rtype, &
    resize_stype, resize_rtype
    integer(kind=MPI_ADDRESS_KIND) :: lb, extent

    ssizes = [n1, n2, n3, n4]
    s_ssizes = [1, n2, n3, n4]
    rsizes = [n1_local+2, n2, n3, n4]
    r_ssizes = [1, n2, n3, n4]
    sstarts = [0, 0, 0, 0]
    rstarts = [1, 0, 0, 0]

    call MPI_Type_get_extent(MPI_INTEGER4, lb, extent, mpierr)

    !create a mpi subarray data type for sending data
    call MPI_Type_create_subarray(4, ssizes, s_ssizes, sstarts, &
    MPI_ORDER_FORTRAN, MPI_INTEGER4, stype, mpierr)

    !resize the send subarray for starting at correct location for next send
    call MPI_Type_create_resized(stype, lb, extent, &
    resize_stype, mpierr)
    call MPI_Type_commit(resize_stype, mpierr)

    !create a mpi subarray data type for receiving data
    call MPI_Type_create_subarray(4, rsizes, r_ssizes, rstarts, &
    MPI_ORDER_FORTRAN, MPI_INTEGER4, rtype, mpierr)

    !resize the receive subarray for starting at correct location for next receive
    call MPI_Type_create_resized(rtype, lb, extent, &
    resize_rtype, mpierr)
    call MPI_Type_commit(resize_rtype, mpierr)

    call MPI_Scatterv(g_arr, scounts, displs, resize_stype, &
    l_arr, scounts(rank), resize_rtype, 0, MPI_COMM_WORLD, mpierr)

  end subroutine scatter4D_arr
end program test_scatterv_gatherv

 

 

Time taken by scatterv operation from rank  5   time:   141.779704116954
Time taken by scatterv operation from rank  1   time:   142.803899861436
Time taken by scatterv operation from rank  6   time:   142.937027430511
Time taken by scatterv operation from rank  7   time:   143.634875432414
Time taken by scatterv operation from rank  4   time:   144.668074302084
Time taken by scatterv operation from rank  2   time:   145.893039251067
Time taken by scatterv operation from rank  3   time:   146.873979726137
Time taken by scatterv operation from rank  0   time:   146.875087157910

Intel(R) MPI Library for Linux* OS, Version 2021.11 Build 20231005 (id: 74c4a23)

-------------------------------------------------------------------------------
Time taken by scatterv operation from rank  1   time:   143.974908194359
Time taken by scatterv operation from rank  5   time:   145.971094210050
Time taken by scatterv operation from rank  2   time:   147.082525824866
Time taken by scatterv operation from rank  7   time:   147.119157463982
Time taken by scatterv operation from rank  6   time:   147.346174288017
Time taken by scatterv operation from rank  3   time:   147.468452539848
Time taken by scatterv operation from rank  0   time:   147.473907662381
Time taken by scatterv operation from rank  4   time:   148.304317163012

Intel(R) MPI Library for Linux* OS, Version 2021.9 Build 20230307 (id: d82b3071db)

-------------------------------------------------------------------------------
Time taken by scatterv operation from rank  1   time:   71008950000396
Time taken by scatterv operation from rank  2   time:   9.83219929999905
Time taken by scatterv operation from rank  3   time:   11.1024133999890
Time taken by scatterv operation from rank  7   time:   12.0688114999793
Time taken by scatterv operation from rank  5   time:   12.2149914999900
Time taken by scatterv operation from rank  6   time:   12.3005766000133
Time taken by scatterv operation from rank  4   time:   13.3717440999753
Time taken by scatterv operation from rank  0   time:   13.5266269000131

Intel(R) MPI Library for Windows* OS, Version 2021.6 Build 20220227  

-------------------------------------------------------------------------------
Time taken by scatterv operation from rank  2   time:   14.251014514999952
Time taken by scatterv operation from rank  1   time:   14.723500717999968
Time taken by scatterv operation from rank  3   time:   14.723493416000110
Time taken by scatterv operation from rank  4   time:   14.726896067000098
Time taken by scatterv operation from rank  5   time:   14.726953725999920
Time taken by scatterv operation from rank  0   time:   14.728668044000187
Time taken by scatterv operation from rank  6   time:   14.728649945999905
Time taken by scatterv operation from rank  7   time:   14.728669964000119

mpich version 4.1.2

--------------------------------------------------------------------------------
Time taken by scatterv operation from rank  1   time:   2.6276495560014155
Time taken by scatterv operation from rank  2   time:   4.2920855869888328
Time taken by scatterv operation from rank  3   time:   6.6306728819909040
Time taken by scatterv operation from rank  4   time:   8.4632198559993412
Time taken by scatterv operation from rank  5   time:   9.9578672840143554
Time taken by scatterv operation from rank  6   time:   11.341454203007743
Time taken by scatterv operation from rank  7   time:   12.863722890004283
Time taken by scatterv operation from rank  0   time:   12.863704065995989

Open MPI 1.10.7

-------------------------------------------------------------------------------  
Time taken by scatterv operation from rank  2   time:   1.1417472362518311
Time taken by scatterv operation from rank  3   time:   2.3085207939147949
Time taken by scatterv operation from rank  4   time:   3.5863912105560303
Time taken by scatterv operation from rank  5   time:   4.8343582153320312
Time taken by scatterv operation from rank  6   time:   6.0975248813629150
Time taken by scatterv operation from rank  7   time:   7.3300142288208008
Time taken by scatterv operation from rank  0   time:   8.5495507717132568
Time taken by scatterv operation from rank  1   time:   8.5495731830596924

mpich version 3.0.4
Labels (3)
0 Kudos
1 Reply
TobiasK
Moderator
469 Views

@nhm


I can only give you the same advice as in the old post, you are using a single element in the leading dimension to declare your subarray. This will break performance. If you care about performance, create a temporary array, pack the data by yourself and unpack it after receiving the data again. It is much faster than using a library to figure out what you are trying to do.

For what it matters, also the other implementations are utterly slow, what you are doing is essentially a memory copy and it takes 8s to complete for ~3.7GB.


Additionally, there are bugs in your code, I spotted two but -check_mpi is till reporting errors.

1)

integer, intent(in) :: scounts(:)

but you access scounts with rank which starts at 0, the lower bound of scounts is 1 in your case.

2)

integer(4), intent(in) :: g_arr(n1, n2, n3, n4)

g_arr is only allocated on rank 0, you are not allowed to pass an unallocated array to a subroutine in Fortran.



Best

Tobias


0 Kudos
Reply