Poor coarray performance for halo exchange

NCarlson · ‎04-13-2022

I have a package (index-map) that handles the parallel halo exchange operation that is associated with domain decomposition methods for PDE. I've recently written an alternative implementation that uses coarrays instead of MPI, and I am finding that it performs very poorly compared to MPI.

I'm using an example program from that package to test performance. The example solves the heat equation on the unit disk using a finite volume (FV) discretization and explicit forward Euler time stepping. A time step consists of a parallel halo exchange to update boundary unknowns with the values from other processes that own the unknowns, followed by a process-local computation to advance the local unknowns. I'm timing the time step loop over thousands of time steps to get an average time per time step. Here are some sample times (usec) using 4 processes (MPI ranks or coarray images) on a shared-memory workstation (which has 12 cores), using the NAG 7.1 compiler and the Intel 2021.5.0 compiler:

	MPI	Coarrays
NAG	15	20
Intel	24	43000

Both compilers are using the same code. On a problem 4 times larger, the NAG times are 37 and 34, but the Intel executable eventually segfaults after a very long time. Watching "top" shows that the hydra_pmi_proxy process is using a significant amount of cpu cycles and its memory usage continues to increase throughout the run, starting from some small value like 5 MB and increasing to over 5 GB. Memory usage with the individual program images remains small and constant. By comparison, gfortran with opencoarrays also uses mpich (I understand the Intel's MPI is derived from MPICH) and in its runs (which complete successfully) memory usage of the hydra_pmi_proxy process remains small and constant throughout.

I'd be interested to know if anyone can reproduce my findings. Full details are in the link above; the test I'm running is disk-fv-parallel.

NCarlson · ‎04-13-2022

Here are some further details for those that are interested but don't want to hunt through the code. The loop being timed is this:

  call system_clock(t1)  
  do step = 1, nstep
    call cell_map%gather_offp(u_local(1:)) ! THE HALO EXCHANGE
    u_prev = u_local
    do j = 1, cell_map%onp_size
      !u_local(j) = u_prev(j) + c*(sum(u_prev(cnhbr_local(:,j))) - 4*u_prev(j))
      block
        integer :: k
        real(r8) :: tmp
        tmp = -4*u_prev(j)
        do k = 1, size(cnhbr,1)
          tmp = tmp + u_prev(cnhbr_local(k,j))
        end do
        u_local(j) = u_prev(j) + c*tmp
      end block
    end do
  end do
  call system_clock(t2, rate)

The call to gather_offp is the halo exchange procedure is effectively this (see this and this) :

  module subroutine gather_offp(this, local_data)
    class(index_map), intent(in) :: this
    real(r8), intent(inout) :: local_data(:)
    call gath2_r8_1(this, local_data(:this%onp_size), local_data(this%onp_size+1:))
  end subroutine

  module subroutine gath2_r8_1(this, onp_data, offp_data)
    class(index_map), intent(in) :: this
    real(r8), intent(in) :: onp_data(:)
    real(r8), intent(inout), target :: offp_data(:)

    integer :: j, k, n

    type box
      real(r8), pointer :: data(:)
    end type
    type(box), allocatable :: offp[:]

    ASSERT(size(onp_data) >= this%onp_size)
    ASSERT(size(offp_data) >= this%offp_size)

    if (.not.allocated(this%offp_index)) return

    allocate(offp[*])
    offp%data => offp_data
    sync all

    n = 0
    do j = 1, size(this%onp_count)
      associate (i => this%onp_image(j), offset => this%offp_offset(j))
        do k = 1, this%onp_count(j)
          n = n + 1
          offp[i]%data(offset+k) = onp_data(this%onp_index(n))
        end do
      end associate
    end do

  end subroutine

Ron_Green · ‎04-13-2022

Our work on Coarrays to date has focused on functionality, not performance. That is on our list of things to work on in the future, after we get IFX to competitive state by the end of this year.

You will see Coarrays coming to IFX soon we hope, but in the same state as that in IFORT.

Now, if you think you have a good MPICH, you can bind your Intel CAF program with your MPICH instead of Intel MPI:

READ THIS for how to use another MPICH with Intel CAF

NCarlson · ‎04-15-2022

@Ron_Green wrote:

[...] you can bind your Intel CAF program with your MPICH instead of Intel MPI

I tried this, but no luck. It runs for a while but ultimately segfaults too, though it seems at a different place:

    type box
      real(r8), pointer :: data(:)
    end type
    type(box), allocatable :: offp[:]
    allocate(offp[*])
    [...]
    dealloacate(offp) ! SEGFAULTS HERE, OR AT NEXT STATEMENT IF OMITTED
  end subroutine

However I did observe that memory use of the hydra_pmi_proxy process remained low and constant, unlike when I used Intel MPI originally. But thanks for the suggestion; it was worth a try.

Michael_S_17 · ‎04-15-2022

Ifort and gfortran/OpenCoarrays do both show the same low coarray performance (pattern) with (Intel) MPICH. To my current understanding, this is related to MPICH’s way to configure for use with mismatching arguments with the Fortran compilers (an MPICH requirement for it’s functions with void?).

Starting with gfortran release 10.0.0 argument mismatches are detected differently with the compiler and MPICH must be configured accordingly. To my current understanding, this new way to configure MPICH with the Fortran compilers leads to a very low coarray performance pattern with gfortran. (I do observe the exactly same low coarray performance pattern with ifort and Intel MPICH). I did try out recent gfortran (with OpenCoarrays) with different MPICH versions, resulting into the same poor coarray performance.

A simple trick to achieve high coarray performance with MPICH is to use older Fortran compiler releases: Gfortran releases prior to 10.0.0; prior ifort releases did also work for much higher coarray performance (I just don’t recall before which ifort version it was).

NCarlson · ‎04-13-2022

Thanks Ron for the quick reply. That sets my mind at ease.

I was not aware I could use a non-Intel MPI with coarrays. I'll give that a shot and see what happens.

Thanks!

Poor coarray performance for halo exchange

Performance