MPI runtime error on repeated "form team" statements

th_fort · ‎03-03-2022

I am trying to distribute coarray images across teams given a condition that is changing at runtime. To do this, I invoke "form team" several times - however, this results in an mpi runtime error after about 200-2000 invocations, depending on the exact configuration.

My best guess is that there is a communicator leak involved here. I have included a minimal working example, which obviously requires coarray support to be enabled. Maybe this is a coding error and there is a better way to switch images across teams?

I have also attached the output from running the below example with and without the "change team" block.

program fortran_team_debug
    
use iso_fortran_env, only: input_unit, output_unit, error_unit, team_type

integer, parameter :: team_1 = 1
integer, parameter :: team_2 = 2
type(team_type) :: main_team
integer :: team_num
integer :: step = 0
integer :: nstep = 1E8

! Ensure differing random seeds across images
call RANDOM_SEED(put = [2390598 + this_image()])

! Main loop
do step = 1, nstep
    ! Team number is assigned according to condition evaluated at runtime
    team_num = assign_team()
    form team(team_num, main_team)
    change team(main_team)
        ! Output number of images in each team
        if(this_image() == 1) then
            write (*, '(a)', advance = 'no') 'Team '
            write (*, '(i0)', advance = 'no') team_number()
            write (*, '(a)', advance = 'no') ' containing '
            write (*, '(i0)', advance = 'no') num_images()
            write (*, '(a)', advance = 'no') ' images at step '
            write (*, '(i0)', advance = 'yes') step
        end if
    end team
end do

contains

! Randomly assign images to team 1 or 2
integer function assign_team()
    real :: rand
    call RANDOM_NUMBER(rand)
    if(rand > 0.5) then
        assign_team = team_1
    else
        assign_team = team_2
    end if
end function
end program fortran_team_debug

Barbara_P_Intel · ‎03-07-2022

I need some more info.

What version of the compiler are you using?

What is the OS?

th_fort · ‎03-07-2022

Thanks for your reply. I've tried it on two different machines, both using ifort version 2021.5.0 Build 20211109_000000:

Windows 10 (10.0.19044) on an Intel Core i9-9900k
CentOS Linux (kernel-3.10.0-1127.8.2.el7.x86_64) on 2 Intel Xeon Gold 6240

The output above is from the Windows machine, but it's practically identical for the Linux machine. Please let me know if you need any more information.

Barbara_P_Intel · ‎03-08-2022

Thank you. Just to be sure I am duplicating your problem correctly I have some more questions.

What compiler options are you using?

How many images?

What did you set to get that MPI output?

th_fort · ‎03-08-2022

Here are the compiler flags, using 8 images for the Windows machine (pretty much the default x64 Debug profile for Visual Studio 2019 with coarray options added):

/nologo /debug:full /Od /Qcoarray:shared /Qcoarray-config-file:"mpi_config.txt" /Qcoarray-num-images:8 /warn:interfaces /module:"x64\Debug\\" /object:"x64\Debug\\" /Fd"x64\Debug\vc160.pdb" /traceback /check:bounds /check:stack /libs:dll /threads /dbglibs /c

And here is the MPI configuration file, mpi_config.txt:

-genvall -genv I_MPI_DEBUG=5 -genv I_MPI_FABRICS=shm -genv I_MPI_SILENT_ABORT=0 -genv I_MPI_FAULT_CONTINUE=0 C:\path\to\executable.exe

Thank you for looking into it.

Barbara_P_Intel · ‎03-10-2022

Thanks for the additional information. I get the same failure. I filed a bug report, CMPLRLIBS-33803. I'll let you know when it's fixed.

th_fort · ‎03-10-2022

Thank you very much! I'll make sure to accept it as a solution as soon as it's fixed.

Michael_S_17 · ‎03-11-2022

I don’t think it’s a bug with the underlying MPI or the compilers here. I am getting similar runtime failures with ifort, with OpenCoarrays/gfortran using Intel OneAPI MPI, as well as OpenCoarrays/gfortran using MPICH. The error messages from using MPICH are more descriptive: ‘Too many communicators’.

Your code example is probably not how coarray teams should be applied, especially the loop. I will try to explain this briefly.

It is important to understand the underlying APGAS model. Coarray Fortran does implement the PGAS model at two levels, SPMD and APGAS. With FORM/CHANGE TEAM we control the execution flow and data allocations at the APGAS level. The APGAS model is an extension of the PGAS model to allow for parallel programming on heterogeneous hardware with different types of accelerators. It was originally developed at IBM and elsewhere, but also with respect to Coarray Fortran:

https://www.cs.rochester.edu/u/cding/amp/papers/full/The%20Asynchronous%20Partitioned%20Global%20Address%20Space%20Model.pdf

In PGAS programming it is the programmer’s job to minimize the PGAS cost function. With respect to execution flow as well as allocations at the APGAS level, the programmer should minimize usage of FORM/CHANGE TEAM in favor of an execution flow (especially loops) at the SPMD level as much as possible:

program Main
  use, intrinsic :: ISO_FORTRAN_ENV, only: team_type
  implicit none

  ! enum for coarray team handling:
  type :: TeamNumbers_EnumDef
    ! with 4 heterogeneous accelerators:
    integer :: Nvidia_GPU = 1
    integer :: Intel_GPU = 2
    integer :: AMD_FPGA = 3
    integer :: Intel_CSA = 4
    integer :: RemainingImages = 5
  end type TeamNumbers_EnumDef
  ! enum type:
  type (TeamNumbers_EnumDef), parameter :: enum_TeamNumber &
     = TeamNumbers_EnumDef ()

  integer :: i_NumberOfTeams, i_NumberOfImagesPerTeam
  integer :: i_UnusedImages, i_TeamNumber
  type (team_type) :: BaseTeam

  i_NumberOfTeams = 4
  if (num_images() < 4) error stop
  i_NumberOfImagesPerTeam = num_images() / i_NumberOfTeams
  i_UnusedImages = mod(num_images(), i_NumberOfTeams) ! these images are not used

  ! split the available images into child teams:
  if (this_image() <= i_NumberOfImagesPerTeam) then
    i_TeamNumber = enum_TeamNumber % Nvidia_GPU
  else if ((this_image() > i_NumberOfImagesPerTeam) .and. &
           (this_image() <= (i_NumberOfImagesPerTeam * 2))) then
    i_TeamNumber = enum_TeamNumber % Intel_GPU
  else if ((this_image() > i_NumberOfImagesPerTeam * 2) .and. &
           (this_image() <= (i_NumberOfImagesPerTeam * 3))) then
    i_TeamNumber = enum_TeamNumber % AMD_FPGA
  else if ((this_image() > i_NumberOfImagesPerTeam * 3) .and. &
           (this_image() <= (i_NumberOfImagesPerTeam * 4))) then
    i_TeamNumber = enum_TeamNumber % Intel_CSA
  else
    i_TeamNumber = enum_TeamNumber % RemainingImages
  end if

  form team (i_TeamNumber, BaseTeam)
  change team (BaseTeam)
  ! APGAS level execution control:
  BaseTeam_select: select case (team_number())
  case (enum_TeamNumber % Nvidia_GPU)
    ! SPMD level workload here
    if (this_image() == 1) write(*,*)'in team',team_number()
  case (enum_TeamNumber % Intel_GPU)
    ! SPMD level workload here
    if (this_image() == 1) write(*,*)'in team',team_number()
  case (enum_TeamNumber % AMD_FPGA)
    ! SPMD level workload here
    if (this_image() == 1) write(*,*)'in team',team_number()
  case (enum_TeamNumber % Intel_CSA)
    ! SPMD level workload here
    if (this_image() == 1) write(*,*)'in team',team_number()
  case default
    ! unused images
    if (this_image() == 1) write(*,*)'in team',team_number()
  end select BaseTeam_select

  end team

end program Main

th_fort · ‎03-13-2022

Thank you for your explanation. My code example was meant as a quick and simple way to reproduce the error. As you say, I would also consider it to be bad practice to reform the teams in every iteration of the loop. In my actual application, a part of my calculations gets simpler as the program proceeds, thus requiring less images and leaving the remaining images to do other work (I use the teams mostly to restrict participation in collective subroutines). Think of it as trying to reform the teams every hour or so of a multi-day calculation.

However, I think you would agree that if I took the same example with repeatedly allocating and deallocating an allocatable variable in a loop like this, it should work even though it might not be good coding practice. I know little about MPI programming (which is why coarrays and teams are convenient for me), but as far as I know there is a way to free communicators no longer in use. Is it not possible to free/replace the communicators used by the teams?

Although this might not be a bug, I would prefer the compiler to throw a warning (if this is indeed a restriction of the current standard) or produce a more easily understandable error message at runtime.

Michael_S_17 · ‎03-14-2022

My mistake: ‘data allocations’ should be ‘coarray allocations’. More precisely, anything that involves collective blocking synchronization at the APGAS execution level (i.e. CHANGE TEAM, END TEAM, reallocating or newly allocate coarrays across teams) should be avoided as much as possible because this would pause/delay further execution on all involved accelerators. That means de facto to reduce the fork-join at the APGAS level (see the APGAS paper) as much as possible.

The (simple) solution is to reuse the already allocated coarrays (across teams but also within teams) for different tasks at the SPMD level and also to change tasks at the SPMD level without changing teams at the APGAS level.

I am working on this with some success already but it requires some effort for high performance: A new type of parallel programming model, a new type of non-blocking synchronization method, and a new type of channel-coroutine system that integrates both the SPMD and the APGAS levels, to allow for a reuse of the allocated coarrays and also to further reduce the PGAS cost function for the data transfers across teams (accelerators). Of course, it is still an early stage and I do only prepare for heterogeneous CAF programming yet.

MPI runtime error on repeated "form team" statements

Fortran Language

Runtime error