OpenMP problem with contained routines

Andrew_Smith · ‎04-20-2009

A contained routine should see variables in the scope of the parent. But when a project is compiled with OpenMP and the parent is called inside a parallel region, the contained routine sees random parent variables!

I have filedpremier support issue 545221

Try attached code

jimdempseyatthecove · ‎04-20-2009

The problem may not be fixable (other that with "you cann't do that").

Your variable threadID is a stack local variable inside subroutine integrator.

Subroutine integrator contains function f(x) that references threadID, a variable in the scope of that which held the contains.

Subroutine integrator calls integrateAnyFunction (nesting deeper on stack)
integrateAnyFunction calls f(x), stackposition of variable changes.

The non-OpenMP "fix" to this would have been (or is)to make variables referenced in a contains (but not passed) to have an implicit SAVE attribute (someFairlyOrdinaryCode :: integrator :: threadID). For single threaded this would work. For multi-threaded it will not, OpenMP implicitly makes threadID stack local (automatic). One potential way for a fix is to make the variable THREADPRIVATE (with decorations to specify the scope threadprivate(someFairlyOrdinaryCode :: integrator :: threadID)). i.e. it becomes a thread private saved variable.

Also, keep in mind

threadID = OMP_Get_Thread_NUM()

does not return a thread ID (enumeration). It returns a thread team member number (enumeration). When nested parallel regions are in effect, multiple threads will have the same OMP_Get_Thread_NUM() numbers.

Comment your code appropriatelty, else you will fail later should you add nested parallel regions later. The next person maintaining your code won't have a clue as to the problem since the code appears to be doing what it says it is doing (when in fact it is not).

Jim Dempsey

Andrew_Smith · ‎04-20-2009

Thanks Jim. I will give the threadPrivate thing a go tommorow.

The call to OMP_Get_Num_Threads was just to illustrate the disconnection problem in a simple way. I could have stored the loc of the K variable instead. I may change the name of the threadID variable in my test to make it clear that is is actually a team ID number.

I did find a workaround that works for us but is no good for use with third party maths libraries. That is to create an array of integers to store the loc of the variables and pass this array down through all the calls through the maths code. Then in the contained function I receive the integer array and connect local variables to each of the integers using the IVF extension pointer(i,var). It works quite fast but is not ideal as it requires special coding in the maths library to pass the integer array.

jimdempseyatthecove · ‎04-21-2009

Thread local storage (TLS) is quite easy to use (once you get the hang of it). There is a very small performance penalty in referencing variables in thread local storage. This overhead is generally smaller than passing a context array about. Note, TLS is faster than making a call to OMP_GET_THREAD_NUM() and you can place into TLS

integer :: ThreadID ! a 0 or 1 based unique identifier
! Note, this is not the same as OMP_GET_THREAD_NUM()
! which returns the current nest level team member number.

The technique I tend to use is to create a user defined type to hold the thread private data then placing a single pointer to this type into the thread private area. You may elect to place the type itself into TLS but my preference was to use a single pointer.

Another technique I like to do is to use TLS to maintain thread private context of temporary arrays for use by individual subroutines. This reduces the number of allocate/deallocate and/or reduces stack size requirements. Inside the thread context object are subroutine thread private contexts for scratch arrays. On entry to the subroutine a function is called containing an argument indicating the minimum size for scratch arrays and returns the pointer to the context. When the size is smaller than what is available (0 for first use) then a deallocate and allocate to new size is performed. The subroutine thread private context grow to size required. Reduction of the allocate/deallocate frequency improves performance and reduces memory fragmentation.

And TLS can then hold a quazi structured exception handler to pass back error codes (unique to the thread)

Jim Dempsey

Andrew_Smith · ‎04-22-2009

According to the Intel help file IVF only supports threadPrivate for common blocks. Can you point me to some help for use with variables?

jimdempseyatthecove · ‎04-22-2009

Here is one way:

[cpp]module mod_common
type    TypeThreadContext
  SEQUENCE
  type(TypeObject), pointer :: pObject
  type(TypeTether), pointer :: pTether
  type(TypeFSInput), pointer :: pFSInput
  integer :: LastObjectLoaded
end type TypeThreadContext

type(TypeThreadContext) :: ThreadContext
COMMON /CONTEXT/ ThreadContext
!$OMP THREADPRIVATE(/CONTEXT/)
end module mod_common

[/cpp]

Where the above provides each thread a private copy of a "global" pointer.

! inside parallel region
ThreadContext%pObject => ObjectTable(iObj)
...
call foo(other, args, here)

subroutine foo(other, args, here)
use mod_common
real :: other, args, here
! Using thread private copy of "global" pointer
ThreadContext%pObject%value = expression(other, args, here)
end subroutine foo

The pointer to the object is long duration and where you do not wish to pass it from subroutine/function to subroutine/function. Typically you had an old sequential progarm that held a copy of an object

call LoadObject(iObj) ! performs a copy operation of contents of object(iObj) to common Object
! a bunch of calls follow
! using/modifying contentsof copy of object
...
call StoreObject(Obj)

When converting to multi-threaded the above won't work (also when object is large the copy operation is a drag on performance)

The first level of optimization is to convert the single threaded version to use a pointer to object and then the call LoadObject(iObj) becomes pObject => Object(iObj) and then there is no requirement call StoreObject.

When you thread the code, simply include the pointer to object in the thread private data structure.

Note, you can use individual variable in place of a user type holding the thread private data.
Or you can use the FPP to cleanup the source code

#define Object ThreadContext%pObject
subroutine foo(other, args, here)
use mod_common
real :: other, args, here
Object%value = expression(other, args, here)
end subroutine foo

Note, #define is active from point of #define through end of source file (or #undef) and it is case sensitive.
And Intelisense will not expand the macro. You will have to specify "ThreadContext%pObject" in the debugger.
On the up sided, use of FPP #define makes for easy upgrade of legacy code.

Jim Dempsey