I have a package (index-map) that handles the parallel halo exchange operation that is associated with domain decomposition methods for PDE. I've recently written an alternative implementation that uses coarrays instead of MPI, and I am finding that it performs very poorly compared to MPI.
I'm using an example program from that package to test performance. The example solves the heat equation on the unit disk using a finite volume (FV) discretization and explicit forward Euler time stepping. A time step consists of a parallel halo exchange to update boundary unknowns with the values from other processes that own the unknowns, followed by a process-local computation to advance the local unknowns. I'm timing the time step loop over thousands of time steps to get an average time per time step. Here are some sample times (usec) using 4 processes (MPI ranks or coarray images) on a shared-memory workstation (which has 12 cores), using the NAG 7.1 compiler and the Intel 2021.5.0 compiler:
Both compilers are using the same code. On a problem 4 times larger, the NAG times are 37 and 34, but the Intel executable eventually segfaults after a very long time. Watching "top" shows that the hydra_pmi_proxy process is using a significant amount of cpu cycles and its memory usage continues to increase throughout the run, starting from some small value like 5 MB and increasing to over 5 GB. Memory usage with the individual program images remains small and constant. By comparison, gfortran with opencoarrays also uses mpich (I understand the Intel's MPI is derived from MPICH) and in its runs (which complete successfully) memory usage of the hydra_pmi_proxy process remains small and constant throughout.
I'd be interested to know if anyone can reproduce my findings. Full details are in the link above; the test I'm running is disk-fv-parallel.
Here are some further details for those that are interested but don't want to hunt through the code. The loop being timed is this:
call system_clock(t1) do step = 1, nstep call cell_map%gather_offp(u_local(1:)) ! THE HALO EXCHANGE u_prev = u_local do j = 1, cell_map%onp_size !u_local(j) = u_prev(j) + c*(sum(u_prev(cnhbr_local(:,j))) - 4*u_prev(j)) block integer :: k real(r8) :: tmp tmp = -4*u_prev(j) do k = 1, size(cnhbr,1) tmp = tmp + u_prev(cnhbr_local(k,j)) end do u_local(j) = u_prev(j) + c*tmp end block end do end do call system_clock(t2, rate)
module subroutine gather_offp(this, local_data) class(index_map), intent(in) :: this real(r8), intent(inout) :: local_data(:) call gath2_r8_1(this, local_data(:this%onp_size), local_data(this%onp_size+1:)) end subroutine module subroutine gath2_r8_1(this, onp_data, offp_data) class(index_map), intent(in) :: this real(r8), intent(in) :: onp_data(:) real(r8), intent(inout), target :: offp_data(:) integer :: j, k, n type box real(r8), pointer :: data(:) end type type(box), allocatable :: offp[:] ASSERT(size(onp_data) >= this%onp_size) ASSERT(size(offp_data) >= this%offp_size) if (.not.allocated(this%offp_index)) return allocate(offp[*]) offp%data => offp_data sync all n = 0 do j = 1, size(this%onp_count) associate (i => this%onp_image(j), offset => this%offp_offset(j)) do k = 1, this%onp_count(j) n = n + 1 offp[i]%data(offset+k) = onp_data(this%onp_index(n)) end do end associate end do end subroutine
Our work on Coarrays to date has focused on functionality, not performance. That is on our list of things to work on in the future, after we get IFX to competitive state by the end of this year.
You will see Coarrays coming to IFX soon we hope, but in the same state as that in IFORT.
Now, if you think you have a good MPICH, you can bind your Intel CAF program with your MPICH instead of Intel MPI:
[...] you can bind your Intel CAF program with your MPICH instead of Intel MPI
I tried this, but no luck. It runs for a while but ultimately segfaults too, though it seems at a different place:
type box real(r8), pointer :: data(:) end type type(box), allocatable :: offp[:] allocate(offp[*]) [...] dealloacate(offp) ! SEGFAULTS HERE, OR AT NEXT STATEMENT IF OMITTED end subroutine
However I did observe that memory use of the hydra_pmi_proxy process remained low and constant, unlike when I used Intel MPI originally. But thanks for the suggestion; it was worth a try.
Ifort and gfortran/OpenCoarrays do both show the same low coarray performance (pattern) with (Intel) MPICH. To my current understanding, this is related to MPICH’s way to configure for use with mismatching arguments with the Fortran compilers (an MPICH requirement for it’s functions with void?).
Starting with gfortran release 10.0.0 argument mismatches are detected differently with the compiler and MPICH must be configured accordingly. To my current understanding, this new way to configure MPICH with the Fortran compilers leads to a very low coarray performance pattern with gfortran. (I do observe the exactly same low coarray performance pattern with ifort and Intel MPICH). I did try out recent gfortran (with OpenCoarrays) with different MPICH versions, resulting into the same poor coarray performance.
A simple trick to achieve high coarray performance with MPICH is to use older Fortran compiler releases: Gfortran releases prior to 10.0.0; prior ifort releases did also work for much higher coarray performance (I just don’t recall before which ifort version it was).