scoping rules/lifetime of allocatable array used in contained procedure under multithread conditions

Elias_Sabbagh · ‎03-30-2016

Hi folks!

Using Parallel Studio XE 2016 Composer integrated into VS 2013 running on Windows 7.

I'm confused about the lifetime of allocatable arrays declared in a procedure, allocated there, and then read from inside a contained procedure.

I have a design that has worked fine for me for a number of years when trying to calculate integrals. An enclosing procedure that will eventually call a generic integration library routine sets up some parameters required by a contained procedure that serves as the function being integrated. This procedure is only allowed one variable, L, the variable of integration, so any other data needed to evaluate it must be grabbed from the enclosing procedure. A reference to the contained procedure is then passed as an actual argument to the generic integration library. Down inside the integration routines, the contained procedure is eventually called, and the contained procedure can then use the variables from the enclosing procedure to successfully compute whatever it needs to. Interfaces are properly declared in the integration library, and the contained procedure conforms to the interface.

The code I describe is built into a DLL. The DLL is often called simultaneously by multiple threads, so I have made sure not to use and global data, and I've declared routines to be recursive, so I think everything should be reentrant -- all data should be passed on the stack. I build and link using "multithreaded DLL" libraries for both Debug and Release configurations.

Up until a month or two ago, everything worked fine. At some point, either due to a compiler upgrade to XE 2016, or due to refactoring of my code, things have stopped working for the multithreaded case. If I execute the code in the DLL serially, there are no problems! Believe me, I've been up and down my code, and I can't see anything that I've done that seems related to the errors I'm seeing.

The errors that occur when I execute the code in parallel are access violations when the kernel contained procedure attempts to read allocatable arrays. Specifically, the arrays are allocatable members of a scalar user-defined type -- a "structure of arrays." If I refactor and create allocatable arrays of a user-defined type -- an "array of structures" -- I still get access violations. It appears that the problem is with the lifetime/scoping of allocatable arrays in general.

Oftentimes, the code's stack gets creamed, and my exception handlers stop working, bringing down the entire EXE. So there's a bit more to the story than just allocatable arrays being garbage collected too early -- something else must be going on, but I'd like confirmation that the allocatable arrays have not in fact been pulled out from under me before I can use them!

Is there some obscure antipattern regarding the lifetime of allocatable arrays that I've stumbled onto?

Thanks,

Elias Sabbagh

Steven_L_Intel1 · ‎03-31-2016

Without a complete example of what you're doing it's hard to know for sure what the issue is. So the best I can do is tell you that an allocatable array that is local to a procedure, and that does not have the SAVE attribute, is automatically deallocated when that procedure exits. As far a contained procedures go, they have access to any variables declared in their host in that "invocation".

There's one other thing that may be important and that's making sure that the Fortran run-time library is prepared for thread safety. Is the main program that calls this DLL written in Fortran? What compiler options are used to compile it (and the DLL)?

jimdempseyatthecove · ‎03-31-2016

Is the following a proper sketch of your code?

subroutine ContainingProcedure
  real, allocatable :: array(:)
  ...
  call ContainedProceedureInit
  do
    call ExternalIntegrator(ContainedProcedureProcess, arg)
    if(converged) exit
  end do
  ...
contains
subroutine ContainedProceedureInit
  ...
  allocate(array(someSize))
  ...
end subroutine ContainedProceedureInit

subroutine ContainedProcedureProcess(arg)
   sometype :: arg
  ...
  x = ...expression with... array(i)
end subroutine ContainedProcedureProcess
end subroutine ContainingProcedure

In reading the Fortran documentation:

Description

Internal procedures are the same as external procedures, except for the following:

Only the host program unit can use an internal procedure.
An internal procedure has access to host entities by host association; that is, names declared in the host program unit are useable within the internal procedure.
In Standard Fortran, the name of an internal procedure must not be passed as an argument to another procedure. However, Intel® Fortran allows an internal procedure name to be passed as an actual argument to another procedure.
An internal procedure must not contain an ENTRY statement.

An internal procedure can reference itself (directly or indirectly); it can be referenced in the execution part of its host and in the execution part of any internal procedure contained in the same host (including itself).

Your use violates first point and third point.

My assumption is the implementation issues of contained procedure is not defined and therefor left to the compiler vendor to implement. In order for a contained procedure to reference host procedure variables, on of (at least) two things has to occur:

a) the equivalent of a stack frame pointer to the host procedure is passed as a hidden argument (similar to string length), meaning your DLL callback is missing this argument.
b) The implementation is made using an analog of the OpenMP parallel region, whereby all the referenced host associated variables are passed as hidden dummy arguments. This too would mean you are missing the arguments.

It is hard to know why your program works single threaded and not multi-threaded, but if I were to guess is both scenarios (single thread, multi-thread) are using uninitialized (stale) stack data, and the multi-threaded case is corrupting the (uninitialized) stack data, where the single threaded is not.

I suggest that you code within the confines of the formal language. In this case, consider using thread private data with the caveat that you must assure that the external DLL (and that which it calls) does nothing to corrupt the association of the access to thread private data. If the called DLL (and that which it calls), which calls your contained routine does .NOT. use its (their) own thread private data, then you will most likely be OK. However, if it/they do, then you might not be OK.

The other thing you will need to do is to remove the callback routine from the contains section, and placed in the outer most scope. You can declare the callback routine PRIVATE or use a prefix/suffix that will not conflict with external procedure names:

subroutine HostProcedureName_Integrate_Callback(...) ! Former contains name Integrate

Jim Dempsey

Elias_Sabbagh · ‎03-31-2016

Jim and Steve-

I think we're getting somewhere. Jim's quote of the spec is encouraging, since I believe I'm only violating point 1, not 3 -- but how to push the code around and still avoid global data is worrying. I think maybe a more detailed sketch of what I'm doing is needed:

Delphi EXE acts as user interface, and performs I/O to a database. User sets up desired calculation, which is broken into a directed acyclic graph of sub-tasks that might depend on one another, and which are computed as Delphi threads. Tasks are issued in the requisite order, as parallel as possible. The actual code that performs a task can be written in any language, as long as it conforms to cdecl calling convention, is delivered in a DLL, and has a DLL entry point named something like "RunTask1" or "RunTask2." Each Delphi thread loads up the DLL dynamically and transfers control to the task entry point, passing in a raft of Delphi callbacks the task implementation can use to grab stuff out of the backend database as it needs it.

Here is pseudo-Fortran for my implementation of RunTask1, which needs to integrate a function as part of computing elements of a big matrix:

-- we have 3 important .f90 files: RunTask1.f90, CompTask1.f90, and Integ.f90. The first two files are compiled as part of RunTask1.dll. The third file is compiled as part of integration.dll, a stand-alone integration facility that many different tasks can use.

-- RunTask1.f90:

integer function RunTask1(obj,pisterminated,plinefromfortran,pputproductlogversion,pputproductlogusage,psetprogressmin,psetprogressmax,psetprogressposition,pgetomega,pgetworkpiecelayercount,...) bind(C, name='RUNTASK1')
!DEC$ ATTRIBUTES DLLEXPORT::RunTask1
use, intrinsic:: ISO_C_BINDING
use kinds
use types
use importsmod
use comptask1
use c_f_interop
...
implicit none

type(C_PTR), value:: obj   ! for use with Delphi callbacks
type(C_FUNPTR), value:: pisterminated     ! These are various Delphi callbacks
type(C_FUNPTR), value:: plinefromfortran
type(C_FUNPTR), value:: pgetproductlog
type(C_FUNPTR), value:: pputproductlogversion
type(C_FUNPTR), value:: pputproductlogusage
type(C_FUNPTR), value:: psetprogressmin
type(C_FUNPTR), value:: psetprogressmax
type(C_FUNPTR), value:: psetprogressposition
type(C_FUNPTR), value:: pgetomega
type(C_FUNPTR), value:: pgetworkpiecelayercount
...
<other declarations>

! Bind the external subroutines to our Fortran function pointers, declared in importsmod
call C_F_PROCPOINTER(pisterminated, IsTerminated)
call C_F_PROCPOINTER(plinefromfortran, LineFromFortran)
call C_F_PROCPOINTER(pgetproductlog, GetProductLog)
call C_F_PROCPOINTER(pputproductlogversion, PutProductLogVersion)
call C_F_PROCPOINTER(pputproductlogusage, PutProductLogUsage)
call C_F_PROCPOINTER(psetprogressmin, SetProgressMin)
call C_F_PROCPOINTER(psetprogressmax, SetProgressMax)
call C_F_PROCPOINTER(psetprogressposition, SetProgressPosition)
call C_F_PROCPOINTER(pgetomega, GetOmega)
call C_F_PROCPOINTER(pgetworkpiecelayercount, GetWorkpieceLayerCount)
...
RunTask1 = SOME_ERROR_VALUE

! Now we use the callbacks to initialize our local variables, and get ready to call some routines in the comptask1 module
...
RunTask1 = CompTask1Section1(obj,param1, param2, param3, etc)
if( RunTask1 is OK ) then
  RunTask1 = CompTask1Section2(obj,param4, param5)
else
  <jump to cleanup code>
endif

...
RunTask1 = SUCCESS_CODE
<cleanupcode>
end function RunTask1

-- CompTask1.f90:

module comptask1
use, intrinsic:: ISO_C_BINDING
use kinds
use types
use importsmod
use constnts
use integration
use integrationinfo
...
implicit none


contains


integer(ik) function CompTask1Section1(obj,param1,param2,param3,etc)
! Does a bunch of integrating in the course of computing a bunch of matrix elements.
! Will eventually be parallelized via OpenMP if I ever get this bug figured out :)
<declare variables here>
! Loop over a 3D grid of cells and compute something for each
do m = 1, Nz
  do l = 1, Ny
    do k = 1, Nx
      ! We can call various Delphi callbacks if need be.  For example:
      if( IsTerminated(obj).eq..TRUE. ) then
        CompTask1Section1 = ERR_TERMINATED
        <goto cleanup code at end of function>
      endif
      ! Now we call our integration routine, which is unfortunately placed in a different file
      integ_ierr = integ(Fcn,current_integration_dimension,current_integration_bounds,...,sums)
      if( integ_err.ne.0 ) then
        CompTask1Section1 = integ_err
        <goto cleanup code at end of function>
      endif
      ! Here's another Delphi callback:
      SetProgressBar(obj, ...)
     enddo
  enddo
enddo
...
contains

  <lots of little helper routines that aid Fcn, and that refer to variables in CompTask1Section1 via host association>

    recursive integer function Fcn(currdim,bounds,sums) result(ierr)
    ! Compute integrands for different cases simultaneously and returns them in sums
    implicit none
    integer, intent(inout) :: currdim
    type(TIntegrationBounds), intent(inout) :: bounds(:)
    complex, intent(inout) :: sums(:,0:)
    ...
    ierr = 0
    ....
    if((<one case>) then
	    sums(1,0) = <some expression involving other functions and variables>
	    sums(2,0) = <some expression involving other functions and variables>
	    sums(3,0) = sums(2,0)
            ... etc
    else
	    sums(1,0) = <another expression involving other functions and variables>
	    sums(2,0) = <another expression involving other functions and variables>
            ... etc
    endif
    return
    end function Fcn
  
end function CompTask1Section1

-- Integ.f90:

module integration
use kinds
use integrationinfo
implicit none

contains

recursive integer function integ(subr,currdim,bounds,...sums) result(ierr)
!DEC$ ATTRIBUTES DLLEXPORT :: integ
! Returns in sums the values of the integrals described by the other dummy arguments.
! Works recursively to handle multidimensional integrals, and tracks the current dimension via currdim
! subr may perform integrand evaluations in parallel.  sums is a two-dimensional array, where the first
! dimension stores the results of the functions, and the second dimension (which must match
! size(bounds)+1) stores the results of the integrations, for all of the dimensions, and
! supplies one dimension of work space for subr.
implicit none
interface
	integer function subr(currdim,bounds,sums) result(ierr)
        use kinds
        use integrationinfo
		implicit none
		integer, intent(inout) :: currdim
		type(TIntegrationBounds), intent(inout) :: bounds(:)
		complex, intent(inout) :: sums(:,0:)
	end function subr
end interface
integer, intent(inout) :: currdim
type(TIntegrationBounds), intent(inout) :: bounds(:)
complex, intent(inout) :: sums(:,0:) ! values of integrals: first dimension is number of functions, second is number of integrals + 1 (zero-based)
...
! based on the incoming integration info, do one or another style of integration.
! Subroutines will eventually set the variable(s) of integration, decrement currdim,
! and call integ recursively, until finally currdim bottoms out, and subr is called with
! the appropriate multidimensional variable(s) of integration set in the bounds parameter.
...
end function integ

end module integration

-- and that's more or less the complete design. It seems to me that the problem is that I've split off the integ() subroutine into Integ.f90. Control flow starts in CompTask1.90 with CompTask1Section1(), jumps over to Integ() in Integ.f90 (crossing a DLL boundary, if it makes any difference), and then returns to CompTask1.90 with a call of the contained procedure Fcn(). Along the way, the stack is creamed by some other thread doing much the same thing, using Integ() in the integration DLL. Wires are crossed, and task1's Fcn() can't see the host associated variables from CompTask1 -- instead, it is presented with garbage.

Is it going to be possible to maintain a separate integration DLL? If not, is it going to be possible to even maintain a separate Integ.f90? I'd hate to have to cut-n-paste the same copy of Integ() into each CompTask1.f90, CompTask2.f90, etc. I suppose I could always include the text in a preprocessing step, but that solution just seems kludgey.

Is there a way to flip the design inside out? Instead of "driver calling recursive integ calling driver child," could it be something like "driver passes sub-driver to recursive integ calling sub-driver calling sub-driver child?"

Sorry for the titanic post, and thanks for your help.

Elias Sabbagh

jimdempseyatthecove · ‎03-31-2016

What you have outlined above is not what you described in your post #1. Post #1 (to my interpretation) stated you were using Host procedure contained routines. In the above, you have a module contained routine. In a module contained routine, the data in the data portion of the module is effectively global to those using the module, and contained routines themselves are also global to those using the module. Therefore you have a conflict when different threads call a procedure that uses the above modules and reference the module declared and defined data (they are sharing the data).

The technique I like to follow is to declare a user defined type containing the required data, then declare an allocatable pointer to thread private instance of the type. Your run once init function would allocate the type, then initialize to whatever you need. You optionally could instantiate a thread private instance of the data, but my preference is to use the pointer. This cuts down on the ~2GB limit for static data (when on Windows).

The constants and parameters can be used in the global context.

Jim Dempsey

Elias_Sabbagh · ‎03-31-2016

Hi Jim!

Thanks for responding, and I'm sorry if I caused confusion by not simply dumping out the whole bag of code in the first place.

But now I'm confused -- more than usual. I've been very careful not to declare any module data, for the very reason that you describe -- multiple threads trying to access it, or calling routines that accessed it, would very quickly cause some sort of a race condition. I've been hoping to rely on host association, as defined in the IVF help under the "Scope" topic, where we also see the sentence "Named entities within the host scoping unit are accessible to the nested scoping unit by host association." I thought I was home-free! I guess it doesn't say anything about the reentrancy I could expect. Are local variables in a module subprogram really not saved on the stack, but in a static data area instead?

In earlier designs I've used your idea of a pointer to extra data being passed down through a chain of integration routines, until the function being integrated can unpack it when called -- but that approach causes a separate copy the integration routines to be written for every single data pointer type that might be useful. If I was coding in C/C++, I'd just declare a void *, but Fortran is very typesafe. I've used Cray pointers before, but they were non-standard. I've also used the transfer() trick to allocate some memory, paint over it with the data I want, and pass the pointer to the "blank" memory down to the function, where it's transfer()'ed back into a user-defined type. That, too, seemed kludgey, and just as with the Cray pointer hack, it's kind of a maintenance hassle. The contained procedure route really seemed like the perfect solution.

So I guess my question has boiled down to -- "Isn't there any way to write a generic integration library that 1) doesn't use void pointers or memory buffers of some sort and 2) doesn't have special knowledge of the function it's trying to integrate?"

Thanks,

Elias Sabbagh

IanH · ‎03-31-2016

Elias Sabbagh wrote:
So I guess my question has boiled down to -- "Isn't there any way to write a generic integration library that 1) doesn't use void pointers or memory buffers of some sort and 2) doesn't have special knowledge of the function it's trying to integrate?"

An approach I have used is for the integration procedure to take the thing to be integrated as a polymorphic object of abstract type, with the function to be evaluated being a specified binding of the type. All the extra data that you want the procedure being integrated to have access to is stashed away in components of a concrete extension of the abstract type. You may find an example using a Runge-Kutta integrator if you search comp.lang.fortran for "where is the green sheep".

Elias_Sabbagh · ‎04-01-2016

Hi Ian-

Damn, that's nice. I keep forgetting that Fortran can do this stuff now :). At first blush it looks like you define an integration pipeline that works on any descendant of the abstract type. The abstract type includes a "virtual" method pointer -- I'm guessing that's what the DEFERRED attribute does. Then, you make a concrete type with the particular state AND the particular "overridden" function pointer that you need, but due to polymorphism, can still be called by the integration routines. The driver code instantiates precisely the derived integration object and derived "function" object desired. I need to read up on these F2003 features!

Regarding thread-safety, and the confusion earlier in this discussion: I'm right to assume that variables declared in module procedures are placed on the stack, are unique to the thread, and are not singleton copies living in the static data area, right? 'Cause if so, I think we've finally found a clean way to do this that's even better than the "contained procedure" attempt.

Elias Sabbagh

jimdempseyatthecove · ‎04-01-2016

>>(am) I'm right to assume that variables declared in module procedures are placed on the stack, are unique to the thread, and are not singleton copies living in the static data area, right?

This depends on compiler options.

No options, default is scalar native types on stack, array native type implicitly SAVE, user defined type not sure.

Various options (including /Qopenmp) make these implicitly on stack. /auto also can be used.

Note, the module with the contained procedures would have to be compiled with /Qopenmp even though it has no !$omp... directives.

Jim Dempsey

Elias_Sabbagh · ‎04-01-2016

Jim, Ian, and Steve-

Thanks for all the great help. I think it's time to rewrite everything as F2003!

Elias Sabbagh

IanH · ‎04-01-2016

Note that your existing approach, where an internal procedure is used as an actual argument (line 30 of CompTask1.f90 in #4), requires Fortran 2008.

The statement in the compiler's documentation that "Only the host program unit can use an internal procedure", as quoted in Jim's post in #3 above, needs clarification - the identifier for an internal procedure is only accessible in the inclusive scope of its host. The language as of Fortran 2008 (and ifort, as an extension long before that) permits the procedure to be invoked from outside the host procedure (via argument association, as in #4, or via procedure pointer association) as long as the relevant instance of the host procedure is still active (the comment about use of an internal procedure name as an actual argument not being standard conforming, quoted in point three of post #3, is written from the perspective of Fortran 2003, which is the language revision that the ifort documentation notionally regards as current).

I don't like relying on inadvertent consequences of implementation detail, instead I try and stay well within the bounds of the behaviour that the standard guarantees. As soon as you start talking about threads and stacks and heaps and data segments (the latter is what Jim meant to say when he said "implicitly SAVE" in #9 - because SAVE is a language concept, not implementation) and whatnot, you are in the world of implementation detail. That aside, because the compiler uses threads in its implementation for some language concepts, such as DO CONCURRENT, other aspects of the implementation need to be able to handle the situation where you have threads calling procedures which then have their internal procedures being referenced via dummy arguments and procedure pointers, while the host procedure instances are still active.

So... in terms of the original problem, without a complete, compilable example, it is difficult to say for sure, but from the description so far I think things should work as the OP expects, though any other sort of error in the code could also be behind the observed problems. In the absence of any other sort of error in the code I suspect something is going astray with the thread safety of the compilers implementation of the thunk necessary to support instances of internal procedures being referenced from outside the host.

I prefer the use of a derived type with binding for this sort of problem, as in my green sheep example, because I consider the nature of the information flow to be more explicit.

Elias_Sabbagh · ‎04-02-2016

Ian-

It really did work, at least for a while. The technique was easy for myself -- and more importantly, for the other F77 speakers I work with! -- to understand. But simplicity come at a price -- incorrectness :). I've already stripped the integration library apart and put it back together in the new OO style, and it's really satisfying, at least for me. Using the new library design will now require me to subclass the new integrand class everyplace there was a contained subprogram, so it's a lot of refactoring, but nothing requiring a lot of thought. One more day of work. My coworkers are just going to have to get used to how things are done in the new century!

Elias Sabbagh

Thomas_K_3 · ‎04-23-2016

This looks quite similar to the issue noted in https://software.intel.com/en-us/forums/intel-visual-fortran-compiler-for-windows/topic/543705. ; Although Ian's point (that reliance on implementation details is dangerous) is well-taken, it would be very helpful if internal procedures were treated in a thread-safe manner (the post I referenced above shows that they were not a year ago, and my recent experience suggests that this is still true with IVF 16.0).

Steve, in the post I referenced above you mentioned that you were going to file this as a feature request. Has there been any progress in making sure that the "bound procedure value" is thread-local? It sounds as though this would have been useful to Elias, and I would also appreciate it.

In the absence of progress on this feature request, would it be possible to make it clear in the documentation that internal procedures are not treated by IVF in a thread-safe manner?

Elias_Sabbagh · ‎04-23-2016

Thomas K.-

I concur, because in many ways my original design was simpler than my current design. But boy, I love my current design! It's the way I would've done things if I had an OO language in the first place -- which, as it turns out, is what I had all along. Modern Fortran allows the usual OO idioms to be directly expressed. The design is also threadsafe, which is kind of a big deal. I got a lot of help from chapter 11 of http://www.futa.edu.ng/materials/fortran-novella-Holcomb.pdf, and IanH's "sheep" post really took things to another level. The modern Fortran language really makes OO programming straightforward -- I'm as excited by it as when I first learned Delphi.

Elias Sabbagh

Steven_L_Intel1 · ‎04-23-2016

I am not aware that anything has changed here yet.