- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have a package (index-map) that handles the parallel halo exchange operation that is associated with domain decomposition methods for PDE. I've recently written an alternative implementation that uses coarrays instead of MPI, and I am finding that it performs very poorly compared to MPI.
I'm using an example program from that package to test performance. The example solves the heat equation on the unit disk using a finite volume (FV) discretization and explicit forward Euler time stepping. A time step consists of a parallel halo exchange to update boundary unknowns with the values from other processes that own the unknowns, followed by a process-local computation to advance the local unknowns. I'm timing the time step loop over thousands of time steps to get an average time per time step. Here are some sample times (usec) using 4 processes (MPI ranks or coarray images) on a shared-memory workstation (which has 12 cores), using the NAG 7.1 compiler and the Intel 2021.5.0 compiler:
MPI | Coarrays | |
NAG | 15 | 20 |
Intel | 24 | 43000 |
Both compilers are using the same code. On a problem 4 times larger, the NAG times are 37 and 34, but the Intel executable eventually segfaults after a very long time. Watching "top" shows that the hydra_pmi_proxy process is using a significant amount of cpu cycles and its memory usage continues to increase throughout the run, starting from some small value like 5 MB and increasing to over 5 GB. Memory usage with the individual program images remains small and constant. By comparison, gfortran with opencoarrays also uses mpich (I understand the Intel's MPI is derived from MPICH) and in its runs (which complete successfully) memory usage of the hydra_pmi_proxy process remains small and constant throughout.
I'd be interested to know if anyone can reproduce my findings. Full details are in the link above; the test I'm running is disk-fv-parallel.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Here are some further details for those that are interested but don't want to hunt through the code. The loop being timed is this:
call system_clock(t1)
do step = 1, nstep
call cell_map%gather_offp(u_local(1:)) ! THE HALO EXCHANGE
u_prev = u_local
do j = 1, cell_map%onp_size
!u_local(j) = u_prev(j) + c*(sum(u_prev(cnhbr_local(:,j))) - 4*u_prev(j))
block
integer :: k
real(r8) :: tmp
tmp = -4*u_prev(j)
do k = 1, size(cnhbr,1)
tmp = tmp + u_prev(cnhbr_local(k,j))
end do
u_local(j) = u_prev(j) + c*tmp
end block
end do
end do
call system_clock(t2, rate)
The call to gather_offp is the halo exchange procedure is effectively this (see this and this) :
module subroutine gather_offp(this, local_data)
class(index_map), intent(in) :: this
real(r8), intent(inout) :: local_data(:)
call gath2_r8_1(this, local_data(:this%onp_size), local_data(this%onp_size+1:))
end subroutine
module subroutine gath2_r8_1(this, onp_data, offp_data)
class(index_map), intent(in) :: this
real(r8), intent(in) :: onp_data(:)
real(r8), intent(inout), target :: offp_data(:)
integer :: j, k, n
type box
real(r8), pointer :: data(:)
end type
type(box), allocatable :: offp[:]
ASSERT(size(onp_data) >= this%onp_size)
ASSERT(size(offp_data) >= this%offp_size)
if (.not.allocated(this%offp_index)) return
allocate(offp[*])
offp%data => offp_data
sync all
n = 0
do j = 1, size(this%onp_count)
associate (i => this%onp_image(j), offset => this%offp_offset(j))
do k = 1, this%onp_count(j)
n = n + 1
offp[i]%data(offset+k) = onp_data(this%onp_index(n))
end do
end associate
end do
end subroutine
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Our work on Coarrays to date has focused on functionality, not performance. That is on our list of things to work on in the future, after we get IFX to competitive state by the end of this year.
You will see Coarrays coming to IFX soon we hope, but in the same state as that in IFORT.
Now, if you think you have a good MPICH, you can bind your Intel CAF program with your MPICH instead of Intel MPI:
READ THIS for how to use another MPICH with Intel CAF
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@Ron_Green wrote:
[...] you can bind your Intel CAF program with your MPICH instead of Intel MPI
I tried this, but no luck. It runs for a while but ultimately segfaults too, though it seems at a different place:
type box
real(r8), pointer :: data(:)
end type
type(box), allocatable :: offp[:]
allocate(offp[*])
[...]
dealloacate(offp) ! SEGFAULTS HERE, OR AT NEXT STATEMENT IF OMITTED
end subroutine
However I did observe that memory use of the hydra_pmi_proxy process remained low and constant, unlike when I used Intel MPI originally. But thanks for the suggestion; it was worth a try.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Ifort and gfortran/OpenCoarrays do both show the same low coarray performance (pattern) with (Intel) MPICH. To my current understanding, this is related to MPICH’s way to configure for use with mismatching arguments with the Fortran compilers (an MPICH requirement for it’s functions with void?).
Starting with gfortran release 10.0.0 argument mismatches are detected differently with the compiler and MPICH must be configured accordingly. To my current understanding, this new way to configure MPICH with the Fortran compilers leads to a very low coarray performance pattern with gfortran. (I do observe the exactly same low coarray performance pattern with ifort and Intel MPICH). I did try out recent gfortran (with OpenCoarrays) with different MPICH versions, resulting into the same poor coarray performance.
A simple trick to achieve high coarray performance with MPICH is to use older Fortran compiler releases: Gfortran releases prior to 10.0.0; prior ifort releases did also work for much higher coarray performance (I just don’t recall before which ifort version it was).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks Ron for the quick reply. That sets my mind at ease.
I was not aware I could use a non-Intel MPI with coarrays. I'll give that a shot and see what happens.
Thanks!

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page