Solved: MPI_Finalize errors unless MPI_Win_Free is called

csubich · ‎04-17-2023

It appears that Intel MPI (version 2021.5) has an un(der?)documented requirement that one-sided communication windows (such as created by `MPI_Win_Create`) must be cleaned up by MPI_Win_Free before MPI_Finalize is called. Otherwise, the program aborts at the MPI_Finalize call.

A minimal program:

program test_window
   !! Test whether MPI dies when a window is created but not freed before MPI_Finalize
   use mpi_f08
   use, intrinsic :: iso_fortran_env
   use, intrinsic :: iso_c_binding
   integer, dimension(10) :: window_array
   integer :: myrank, numproc
   type(MPI_Win) :: created_window

   call MPI_Init()
   call MPI_Comm_size(MPI_COMM_WORLD,numproc)
   call MPI_Comm_Rank(MPI_COMM_WORLD,myrank)

   write(0,'("Rank ",I0,"/",I0," initialized")') myrank+1, numproc

   call MPI_Win_Create(window_array, int(10,kind=MPI_ADDRESS_KIND), &
                       1, MPI_INFO_NULL, MPI_COMM_WORLD, created_window)

   write(0,'("Rank ",I0," created window")') myrank+1

   call MPI_Finalize()

   write(0,'(" Rank ",I0," finalized")') myrank+1

end program

with execution:

$ mpirun --version
Intel(R) MPI Library for Linux* OS, Version 2021.5 Build 20211102 (id: 9279b7d62)
Copyright 2003-2021, Intel Corporation.

$ mpiifort test_window.F90 -o test_window # compilation
$ mpirun -np 2 ./test_window # execution
Rank 2/2 initialized
Rank 1/2 initialized
Rank 1 created window
Rank 2 created window
Abort(806968335) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Finalize: Other MPI error, error stack:
PMPI_Finalize(218)...............: MPI_Finalize failed
PMPI_Finalize(160)...............: 
MPID_Finalize(1476)..............: 
MPIDI_OFI_mpi_finalize_hook(2291): OFI domain close failed (ofi_init.c:2291:MPIDI_OFI_mpi_finalize_hook:Device or resource busy)

This causes a small amount of trouble. I intend to use the one-sided communication windows inside a much larger, older program that doesn't bother with pesky things like deallocation, allowing the operating system to do it on exit.

I can't find any part of the MPI 3.1 specification that marks this program as having erroneous behaviour, but I could be wrong. This code also works as intended (without the abort) on a workstation with OpenMPI.

If this program is indeed spec-compliant, are there any environmental or configuration flags to avoid the program abort?

csubich · ‎05-11-2023

I'm sorry, I had missed the previous response to this thread.

At this point, I don't think there's much more I can say to convince you that the documentation is lacking. My case is the following:

The MPI specification does not explicitly state that MPI Windows must be freed before the call to MPI_Finalize. It does state this for open communications (nonblocking communication operations) and for open MPI files (which must be closed), but not windows, but it is equally silent about MPI communicators (which in practice do not have to be freed).
Not all MPI implementations (see OpenMPI's behaviour above) raise an error when MPI_Finalize is called with an open window.
Therefore, it is reasonable for a programmer to think that MPI Window objects are more like communicators than like MPI_Request objects. It is surprising when Intel MPI raises an error with the MPI_Finalize call, particularly since the error bubbles up from deeper communication layers.

Thus, I think it would be very helpful and developer-friendly if the Intel MPI documentation somewhere made an explicit note of this requirement. The existence of such documentation would have saved me a day of debugging, particularly if it had been discoverable via google somewhere.

If Intel thinks this an unreasonable request, then I have no right to insist.

With that said, I think this thread can be closed now. 'Wontfix' or 'notabug' are both answers to my problem, even if they aren't the answers I'd prefer.

View solution in original post

AishwaryaCV_Intel · ‎04-20-2023

Hi,

Thank you for posting in intel communities.

We were able to reproduce the issue, we are working on your issue and will get back to you soon.

Thanks And Regards,

Aishwarya

AishwaryaCV_Intel · ‎04-21-2023

Hi,

The RMA window must be explicitly freed after creating window object with MPI_Win_Create, and suggesting to use MPI_Win_fence before MPI_Win_Free for better synchronization.

>>>This causes a small amount of trouble. I intend to use the one-sided communication windows inside a much larger, older program that doesn't bother with pesky things like deallocation, allowing the operating system to do it on exit.

I'm not sure that it makes code more complicated;conceptually it's similar to other programming allocate/deallocate memory pairs.

I understand that OpenMPI does not produce error without MPI_Win_free call but it can be considered as implementation deficiency rather than advantage.

Finally, Intel MPI comes with MPI micro-benchmarks called IMB that you can compile from the source, e.g., /opt/intel/oneapi/mpi/2021.9.0/benchmarks/imb/src_c

You can use the file IMB_window.c located in above directory (or similar on your system depending on IMPI version) as an example with RMA window creation/destruction.

Thanks And Regards,

Aishwarya

csubich · ‎04-24-2023

> The RMA window must be explicitly freed after creating window object with MPI_Win_Create, and suggesting to use MPI_Win_fence before MPI_Win_Free for better synchronization.

> I understand that OpenMPI does not produce error without MPI_Win_free call but it can be considered as implementation deficiency rather than advantage.

That's a perfectly cromulent requirement, but I ask that it be documented somewhere a little more explicitly. As of this writing, a Google search for '"MPI_FINALIZE" "MPI_WIN_FREE"' returns this thread as its top response, with little else of relevance in the first page.

I've come around to accepting that this behaviour is standards-permissible, but it's not immediately obvious that it's standards-required. In the documentation of MPI_FINALIZE, the standard (3.1) says:

>> Before an MPI process invokes MPI_FINALIZE, the process must perform all MPI calls needed to complete its involvement in MPI communications: It must locally complete all MPI operations that it initiated and must execute matching calls needed to complete MPI
communications initiated by other processes. For example, if the process executed a non-blocking send, it must eventually call MPI_WAIT, MPI_TEST, MPI_REQUEST_FREE, or any derived function; if the process is the target of a send, then it must post the matching
receive; if it is part of a group executing a collective operation, then it must have completed its participation in the operation.

This description notably does not deal with one-sided communication. It admits an interpretation where windows have to be freed, but it also allows for an interpretation where MPI_FINALIZE is valid if there are no active access or exposure epochs. There may be a further hint towards this requirement in the MPI specification section referring to dynamic process management (section 10.5.4):

>> To disconnect two processes you may need to call MPI_COMM_DISCONNECT, MPI_WIN_FREE, and MPI_FILE_CLOSE to remove all
communication paths between the two processes. Note that it may be necessary to disconnect several communicators (or to free several windows or files) before two processes are completely independent. (End of advice to users.)

… however, this section also seems to apply only to dynamic applications that attach and remove processes from each others' communicators. In particular, a strict reading would suggest that an MPI application that creates a communicator must call MPI_COMM_DISCONNET (and not just MPI_COMM_FREE, per the subsequent note) for each created communicator before finalizing, and no MPI library that I'm aware of tries to impose this requirement.

> I'm not sure that it makes code more complicated;conceptually it's similar to other programming allocate/deallocate memory pairs.

That's the problem. The legacy application I'm working on does not have allocate/deallocate memory pairs. It's designed to execute in a batch environment, where workspace arrays are allocated once but are implicitly freed by program termination.

In normal operation, this is not a major limitation. MPI cleanup code can be added before the one spot where the program terminates normally, after completing its batch. However, the problem arises with abnormal termination; the program makes the implicit assumption that a detected error (such as invalid input data) can always promptly, simply, and correctly end the program by:

% Perform any necessary synchronization of currently outstanding communications – MPI wait/etc
call MPI_Finalize()
stop 'Error detected'

In particular, a call to MPI_Abort or any other programmatically-abnormal termination generates misleading execution traces, burying whatever diagnostic messages are produced about the true error. (That is, if I'm asking the program to read a file that doesn't exist, then it shouldn't be yelling at me about OFI domain closings per the thread-opening post).

MPI Windows in the Intel MPI framework, however, are evidently long-running communications that are not visible within the local scope. That poses a problem, particularly since some detectable errors may occur before the windows are even created such that an unconditional MPI_WIN_FREE would be clearly bad.

csubich · ‎04-24-2023

As a note for posterity, one workaround for this issue is to attach a callback function to MPI_COMM_SELF, to be called when the communicator is freed as the first step of MPI_Finalize (MPI spec 3.1, §8.7.1). This callback function can call MPI_Win_Free on previously-created windows, cleaning everything up before MPI_Finalize continues.

This solution is imperfect. It requires separate bookeeping and lifecycle tracking of MPI_Win objects to prevent double frees (an error condition), but it suffices for my use case of "a handful of mostly static windows are created."

AishwaryaCV_Intel · ‎05-02-2023

Hi,

We are working on it and will get back to you.

Thanks And Regards,

Aishwarya

AishwaryaCV_Intel · ‎05-03-2023

Hi,

There is no implicit resource deallocation, You may refer this:

https://www.mpi-forum.org/docs/mpi-3.1/mpi31-report.pdf

Section 8.7, "Startup"

Page 357, lines 42-43 says:

"The call to MPI_FINALIZE does not free objects created by MPI calls; these objects are freed using MPI_XXX_FREE calls."

Thanks And Regards,

Aishwarya

csubich · ‎05-03-2023

There is no implicit resource deallocation, You may refer this:

Indeed, but I would not expect that the internal data associated with a window (or the allocated memory with an MPI_Alloc_mem or MPI_Win_alloc call) be freed at the time of finalize.

It is not an error, for example, to fail to free MPI datatypes created by MPI_Type_create_* before calling MPI_Finalize, nor is it an error to fail to free communicators via MPI_Comm_free. The objects so created have undefined state after the MPI_Finalize call and should not be used, but the Finalize call itself is not expected to error. (In fact, freeing the objects after finalization would be an error, since it would involve calls to routines that are forbidden after finalization.)

The implicit assumption being made by Intel MPI seems to be that an MPI_Win object is by itself a communication, relying on the following part of the specification (also p357, lines 34-37)

Before an MPI process invokes MPI_FINALIZE, the process must perform all MPI calls
needed to complete its involvement in MPI communications: It must locally complete all
MPI operations that it initiated and must execute matching calls needed to complete MPI
communications initiated by other processes

However, it's not obvious that an MPI Window object is a long-lived communication. The MPI specification seems to draw a distinction between the window itself and RMA communications. Section 11.3 lists the "RMA communication calls" (Put/Get/etc) in a separate section from 11.2 (initialization/windows), and section 11.5 defines "active target communication" and "passive target communication," which involve access or exposure epochs created by the RMA synchronization routines.

It doesn't help that the MPI standard seems to lack an explicit definition for 'communication', instead leaving it to intuitive understanding. Note also that the specification specifically clears up a similar potential ambiguity regarding MPI I/O (p494, lines 3-4):

Before calling MPI_FINALIZE, the user is required to close (via MPI_FILE_CLOSE) all files
that were opened with MPI_FILE_OPEN.

Again, I don't necessarily think that the Intel MPI behaviour is erroneous, but it was surprising. It does not appear to be required by the specification, so I think it would be developer-friendly if Intel officially documented this requirement somewhere.

AishwaryaCV_Intel · ‎05-04-2023

Hi,

Allocation/Deallocation mechanism requirement is not exclusive to Intel MPI but applies to any MPI. Therefore we can refer to standard MPI 3.1 specification as the link mentioned in my previous response. However, with intel MPI implementation, we ensure that errors are being reported.

Thanks And Regards,

Aishwarya

AishwaryaCV_Intel · ‎05-11-2023

Hi,

We haven't heard back from you, could you please let us know if your issue got resolved? And could you please confirm if we can close this case ?

Thanks And Regards,

Aishwarya

csubich · ‎05-11-2023

I'm sorry, I had missed the previous response to this thread.

At this point, I don't think there's much more I can say to convince you that the documentation is lacking. My case is the following:

The MPI specification does not explicitly state that MPI Windows must be freed before the call to MPI_Finalize. It does state this for open communications (nonblocking communication operations) and for open MPI files (which must be closed), but not windows, but it is equally silent about MPI communicators (which in practice do not have to be freed).
Not all MPI implementations (see OpenMPI's behaviour above) raise an error when MPI_Finalize is called with an open window.
Therefore, it is reasonable for a programmer to think that MPI Window objects are more like communicators than like MPI_Request objects. It is surprising when Intel MPI raises an error with the MPI_Finalize call, particularly since the error bubbles up from deeper communication layers.

Thus, I think it would be very helpful and developer-friendly if the Intel MPI documentation somewhere made an explicit note of this requirement. The existence of such documentation would have saved me a day of debugging, particularly if it had been discoverable via google somewhere.

If Intel thinks this an unreasonable request, then I have no right to insist.

With that said, I think this thread can be closed now. 'Wontfix' or 'notabug' are both answers to my problem, even if they aren't the answers I'd prefer.

AishwaryaCV_Intel · ‎05-12-2023

We really understand your concern and will forward your feedback to our Developer team. We are going to close this thread. If u have any other issue please raise a new thread

Thanks And Regards,

Aishwarya