Solved: Re: Re:ifx/ifort: Coarray Teams are still not properly implemented with the Intel compilers

Michael_S_17 · ‎03-28-2023

Recent Intel compilers ifort (2021.8.0) and ifx (2023.0.0) still have the same severe problem with coarray teams: coarray teams are not implemented as separate SPMD environments, i.e. coarrays can’t be allocated inside coarray teams (within a CHANGE TEAM construct) but only across teams within the initial team.

Two code examples that do perfectly work for years using OpenCoarrays were already given here:

https://community.intel.com/t5/Intel-Fortran-Compiler/coarray-coarray-this-image-for-coarrays-allocated-inside-teams/td-p/1367474

and here:

https://community.intel.com/t5/Intel-Fortran-Compiler/ifort-2021-4-0-runtime-failure-using-SYNC-MEMORY-in-a-coarray/td-p/1340691

The relevant section in Modern Fortran explained is chapter 20.5 Coarrays allocated in teams.

Only because of this issue, the Intel compilers still can’t be used for serious coarray programming:

I am using two branches for my codes, the first branch for use with ifort/ifx (that does also work with OpenCoarrays), and the second branch for use with OpenCoarrays only. The codes from the ifort/ifx branch are rather a mess, whereas the OpenCoarrays branch codes are already in good quality.

TobiasK · ‎11-10-2023

Hi @Michael_S_17

the developers informed that the fix will be included in the 2024.1 release, sorry that it did not make it into 2024.0.

PS: still got no time to read through the docs you provided:(

View solution in original post

Michael_S_17 · ‎08-06-2023

As a reminder, here is another short example program to reproduce the issue when allocating coarrays inside coarray teams:

! bug: ifort/ifx still can't allocate inside CHANGE TEAM constructs
module a
use, intrinsic :: ISO_FORTRAN_ENV, only: team_type
implicit none
private
type, public :: a_type
  private
  integer, codimension[:], allocatable :: coarray_component_1
  integer, codimension[:], allocatable :: coarray_component_2
contains
  private
  procedure, public :: routine => a_routine
end type a_type
type (team_type), public :: a_team

contains

subroutine a_routine (this)
  class (a_type), intent (inout) :: this
  integer :: NumberOfTeams
  integer :: NumberOfImagesPerTeam
  integer :: TeamNumber
  NumberOfTeams = 2
  NumberOfImagesPerTeam = num_images() / NumberOfTeams
  if (this_image() <= NumberOfImagesPerTeam) then
    TeamNumber = 1
  else
    TeamNumber = 2
  end if
  ! no problem across teams:
  allocate (this % coarray_component_1 [*])
  sync memory
  write(*,*) 'everything is fine in team number', team_number()
  !
  form team (TeamNumber, a_team)
  change team (a_team)
    select case (team_number())
    case (1)
      ! no problem in team number 1:
      allocate (this % coarray_component_2 [*])
      sync memory
      write(*,*)' everything is fine in team number', team_number()
    case (2)
      ! ALLOCATE and SYNC MEMORY in team number 2 raises run-time failure (the program hangs),
      ! for this reproducer I had to use an explicit call to SYNC MEMORY to raise the failure,
      ! in my real world programming it is not necessary to explicitely call SYNC MEMORY to
      ! raise such failures.
      ! The SYNC MEMORY itself should not be the source for the failure because I can already
      ! use ifx/ifort (as well as OpenCoarrays/gfortran) successfully for advanced Fortran
      ! coreRMA programming without any problems.
      allocate (this % coarray_component_2 [*])
      sync memory
      write(*,*) '   everything is fine in team number', team_number()
    end select
  end team
end subroutine a_routine
end module a


program main
  use a
  implicit none
  type(a_type) :: test_object
  call test_object % routine
end program main

TobiasK · ‎08-07-2023

Hi Michael,

thanks for the reminder, it seems it has been indeed overlooked.

I escalated it to the developers.

Just out of personal curiosity:

What is the benefit from using Coarrays over 'manual' MPI, did you really save time in programming or see a performance gain?

Michael_S_17 · ‎08-10-2023

@TobiasK wrote:
Hi Michael,
thanks for the reminder, it seems it has been indeed overlooked.
I escalated it to the developers.

Just out of personal curiosity:
What is the benefit from using Coarrays over 'manual' MPI, did you really save time in programming or see a performance gain?

Hi Tobias,

MPI: two-sided communication

RMA: one-sided communication

Coarrays are RMA.

For the required massive increase of communication-with-computation overlaps on exascale hardware, asynchronous and task-based programming models appear as an attractive approach. This does lead into fine-grained communication that requires non-blocking synchronization techniques.

To keep MPI vital to this kind of programming, hardware accelerated message matching is a currently proposed solution: https://www.queensu.ca/academia/afsahi/pprl/papers/CCGrid-2019.pdf

RMA/coarrays, on the other hand, do not need any kind of message matching. Coarrays provide a simple but powerful higher-level interface to coreRMA programming:

http://w.unixer.de/publications/img/dan-rma-model.pdf

https://www.sri.inf.ethz.ch/publications/dan2016modeling

Especially with Fortran’s SYNC MEMORY statement (RMA flush), we have a powerful means to implement higher-level non-blocking synchronization primitives easily and in pure Fortran. Then we can use Fortran’s BLOCK construct to easily implement single-image asynchronous task execution as well (i.e. multiple tasks simultaneously on each coarray image), to allow for (unlimited) communication-computation overlaps. I am using these techniques currently to implement data-flow programming models for spatial architectures in pure Fortran, which leads to a simple programming style that should allow for new types of algorithms on exascale hardware as well: https://github.com/MichaelSiehl/Spatial_Fortran_1

cheers

TobiasK · ‎08-17-2023

Hi Michael,

well MPI 2.0 and especially MPI 3.0 include RMA one-sided in addition to two sided communication. To my knowledge, all of the PGAS languages have at least a MPI backend.

That's why I asked if you have some experience with Coarrays that gives significantly better performance than if you would do it on your own with MPI 3.

For sure it replaces the MPI RMA semantics with Coarray semantics, however, I am not so sure about how much one does really save in programming time. At the same time one looses access to all the libraries relying on MPI or OpenMP.

Best

Tobias

Michael_S_17 · ‎08-18-2023

MPI 2.0 and especially MPI 3.0 include RMA one-sided in addition to two sided communication. To my knowledge, all of the PGAS languages have at least a MPI backend.

That is implementation vs. functionality, my viewpoint is functionality. Fortran (since the 2008 standard) does offer PGAS with additional coreRMA programming capabilities through coarrays. The same is not necessarily true for other PGAS languages: “UPC++ does not provide any calls that implicitly “fence” or “flush” outstanding asynchronous operations; all operations are synchronized explicitly via completions.”, https://upcxx.lbl.gov/docs/html/guide.html#interaction-of-collectives-and-operation-completion .

I can only share my personal Fotran-based viewpoint where I can’t see how the implementation of data flow programming models could possibly work without Fortran’s additional coreRMA functionality through coarrays. PGAS alone for data flow? – I can’t see that. CoreRMA alone, on the other hand, would also not work for me. It’s rather the combined usage of both, PGAS and coreRMA through coarrays, that already opens up a new world of Fortran programming (on CPUs only yet with todays Fortran compilers).

That's why I asked if you have some experience with Coarrays that gives significantly better performance than if you would do it on your own with MPI 3.

I don’t think anybody could do any serious performance comparison or testing yet, because performance will certainly steam mainly from the underlying programming model that must be adapted by programmers/designers to (new kinds of) algorithms. It should easily take years to figure out about new possibilities for algorithm development on new types of hardware when applying one-sided communication techniques.

Nevertheless, to get some notion about the performance possibilities of one-sided communication with coarrays, avoiding

./configure FFLAGS=-fallow-argument-mismatch FCFLAGS=-fallow-argument-mismatch

when compiling MPICH (on my computer) is a simple way to achieve very high coarray performance with gfortran/Opencoarrays (and possibly with Intel compilers as well).

For sure it replaces the MPI RMA semantics with Coarray semantics, however, I am not so sure about how much one does really save in programming time. At the same time one looses access to all the libraries relying on MPI or OpenMP.

It is not only PGAS and coreRMA functionality but also Fortran’s additional functionality, e.g. combined coarray with array syntax, BLOCK construct, as well as symmetric memory through coarrays.

Using existing library codes, usually optimized through (two-sided) MPI, together with (one-sided) PGAS/coreRMA programming is an important topic with Fortran as well: https://www.mcs.anl.gov/papers/P5050-1213.pdf .

(In section 5 the authors did request for a non-blocking flush operation in MPI. As far as I understand it yet, the paper did focus on PGAS functionality alone from the implementers view. Using direct coreRMA programming (SYNC MEMORY) together with a qualified programming model, I don’t think that such request for non-blocking flush is still necessary (I could be wrong), we already can overlap communications with computations efficiently.)

regards

TobiasK · ‎11-10-2023

Hi @Michael_S_17

the developers informed that the fix will be included in the 2024.1 release, sorry that it did not make it into 2024.0.

PS: still got no time to read through the docs you provided:(

Michael_S_17 · ‎11-12-2023

Thanks very much for that feedback. I will check the ifx/ifort releases, when they are available, also with my other coarray codes that already work with OpenCoarrays.

The above links to documents regarding RMA are only for reference purposes. A more convenient way into the topic is chapter 19 of the DPC++ book (even if DPC++/SYCL is not based on RMA/PGAS, many topics there are valid to Coarray Fortran as well) : https://link.springer.com/chapter/10.1007/978-1-4842-5574-2_19

To enter the world of RMA programming in Fortran, the ‘Modern Fortran (2018) explained’ book gives a SYNC MEMORY/atomics syntax example in appendix A.9.1 (deprecated features), figure A.2 on page 453. The warnings there are valid: all this depends heavily on hardware support and implementation and the programmer is required to ensure a sequentially consistent memory ordering / happens-before orderings. In practice, coarray RMA-codes are very different from that example code.

Regards