Can threadprivate custom type variables be changed from within a parallel region?

Elarion245 · ‎03-23-2020

Hello world,

the title already states what I would like to find out. I have a global variable which is used by OpenMp and declared threadprivate.

Within the parallel loop, the code may find the need to dynamically reload data from the disk. In this case the code currently

extends the data in in allocatable array of the custom type.

This seems to have been working for some time, however now after the parallel region I added some code for extra analysis and

suddenly I get access violations from time to time (not every run) and I think that this might be due to growing the real array within the custom type.

Even if this is non-standard (and I should probably code something around it), I am still wondering why this would be a violation, as the results from the parallel loop look as excepted and once the omp parallel do is finished, I would have thought that memory that belongs to a hibernating thread should be freed?

Thanks for any help you can provide!

jimdempseyatthecove · ‎03-24-2020

Can you show:

What is thread private
What you declare as shared or private (or default) when you enter the parallel region
What you code after the parallel region is doing.

Taking a flying leap of a guess, assuming no glaring coding problems, an un-glaring problem may be the the code preceding a loop sets an initial size for the thread private allocatable objectes (to be allocated in the subsiquent parallel region). After the parallel region, the main thread expands the size to use (and possibly expands the main thread's thread private allocation), the loop cycles and the main thread has the new size, the other team member do not.

Jim Dempsey

Elarion245 · ‎03-24-2020

Hi Jim,

thanks for taking the time to respond. I was trying to avoid the detailed explanation since the overall code is very large and in essence the question in the title should allow for a clear answer according to the specifications (at least I was thinking so).

The behavior you mention would fit my observations. Nevertheless I have not been able to find any documentation about this. Neither in the specifications or the intel pages. I know that typically one would not want to do this and the way the code is now, the variables aren't changed from within a parallel context, but still I was hoping for some explanation.

I can provide some further details though:

In the initialization phase the tool has to read in some files (astrodynamics tables) of data. The routines were coded by others but essentially since the data is linked to days, arrays of a custom type are created for each line. To access the data via the integer day they belong to (modified julian date), the array is initialized using boundaries such as 58000:58900.

These allocatable arrays are part of a larger custom type, which is global in scope and declared threadprivate.

The initialization of the variable is handled via the standard copyin clause in case of a parallelized loop (classical omp parallel do).

Now what happened and led to the bug mentioned here: https://stackoverflow.com/questions/60799041/fortran-openmp-ghost-threads-interrupt-application-upon-allocate

During the loop (which operates on other data), it was found that not enough input lines were read in in the beginning of the tool. In this case the data is automatically reloaded from file. For my specific application I was the first to call a routine that has this functionality in parallel. Hence the threadprivate global variable was modified from within the threads.

This worked perfectly (at least I was thinking so and in fact I did not even notice that data was dynamically reloaded).

Now I added some further data processing after the parallel loop and surprisingly my code started to behave weird. I received this very strange access violation that I was asking about on stack overflow and debugged it for days. In the end I found out that everything was back to normal simply by avoiding any change of the threadprivate variable in the first parallel loop.

-> Hence the hopefully generic question:

Can a global, threadprivate modle variable that has a shape like this:

Type tGlobal
   ... other stuff

    Type(tEOP), allocatable, dimension(:)             :: EOP                  
   ... other stuff
End Type tGlobal

where EOP contains only scalar integers and double-precision reals, be modified, i.e. the array dimensions be changed, from a function called in parallel context?

jimdempseyatthecove · ‎03-24-2020

>>the variables aren't changed from within a parallel context,

Then they need not be threadprivate.

>>initialization of the variable is handled via the standard copyin clause

*** copyin is not used with declared threadprivate. Doing so makes a thread local stack copy of its own threadprivate data. IOW inside the parallel region the local stack copy of its own threadprivate data would be updated but not the threadprivate data itself.On exit from the parallel region, the updated local stack copy of its own threadprivate data would be lost.

>>For my specific application I was the first to call a routine that has this functionality in parallel. Hence the threadprivate global variable was modified from within the threads.

SOP is to read in the data from the file into a shared array. Either from outside the parallel region .OR. from the main thread (or single thread) with a !$OMP BARRIER following the read-in.

>>Can a global, threadprivate modle variable that has a shape like this:

An instance of a user type can be threadprivate. The contained allocatable array descriptors would be threadprivate as well.

*** if your master thread (or single thread) is reading the input data into its own thread private array, the other threads will not see it.

Jim Dempsey

jimdempseyatthecove · ‎03-24-2020

Additional information.

Likely the main thread's threadprivate data would be copied in to the other team members stack copy of their own threadprivate array.

This data would be used in the visible scope of the parallel region **** but not to the same named threadprivate arrays of called functions/subroutines. They would reference the declared threadprivate data and not the stack copy of their own threadprivate array of the call level of the !$OMP PARALLEL...

Jim Dempsey

Elarion245 · ‎03-24-2020

I mayby did too much of abstraction. The overall global type contains parts that are modified in parallel regions. Therefore they need to be threadprivate. I am now just avoiding to alter the internal EOP array. If I detect the need to reload data, I simply do not write it back to the threadprivate global variable but to a local one and continue the iteration with this value. Also I increases the timeframe for which data is loaded initially. Hence for this sub-problem I am not modifying the variables anymore, yet they need to be threadprivate.

The copyin is only used in another module which has the use-clause to load the module varible and then uses the copyin to initialize the sub-threads. The loading itself is handeled using a critical section.

As mentioned earlier, the program did its job up to the point I coded it. Then I added further subroutines for data analyzation and then it started to act strange. Sometimes it worked, sometimes it did not and crashed at a random allocate statement. So I had memory corruption going on - despite the fact that the module variable was threadprivate.

As you note though, once we go into details, things are easily either misunderstood or poorly desribed. I only noticed that whenever the threadprivate copy of the module variable is modified, some hinky things happen after the parallel loop -> also if this part is not within a parallel context anymore!

jimdempseyatthecove · ‎03-24-2020

I ran a test program, and interestingly, the behavior has changed, and may be a bug fix or feature enhancement. This is what I learned with tp as a threadprivate variable in a module:

!$omp parallel private(tp)
... A SHARABLE or THREADPRIVATE entity is not permitted in a PRIVATE, FIRSTPRIVATE, LASTPRIVATE, SHARED or REDUCTION clause. [TP]

!$omp parallel copyin(tp)
This copies into the threadprivate copy of the module variable tp

!$omp parallel private(i) ! i is not threadprivate, IOW local or dummy
{ here i is a private variable to each thread

Jim Dempsey

Elarion245 · ‎03-25-2020

Hi Him,

in my case the variable, call it tp, is defined in module Mod_TP.

The module containing the parallel loop is e.g. Mod_Work, which then has a

use Mod_TP, only: tp

The parallel loop then has a Default None clause and only the copyin(tp). Your first message is known to me, as a threadprivate variable, as the name states, has to be private to the thread. Anything else would not make sense.

This way the program compiled perfectly and it also worked until the loop was finished. But: Any allocate after the loop may have caused a sudden stop due to memory corruption due to changing the array dimensions (growing in my case) within the custom type tp.

Long story short: there is no compiler warning, and any test program should not stop after the loop but do further allocation/deallocation afterwards.

Since my program sometimes worked and sometimes not, I guess that one needs some "look" for the machine to run into the corrupted area.

I guess that the following might be happening:

In my situation some thread may change the memory dimension of the threadprivate variable. Probably this works fine for any thread except the master thread. If however the memory growing takes place for the master thread, then this has an impact on the memory layout after the parallel loop and causes memory corruption. Could this be the case?

Otherwise I do not see why this, even if it is bad style, should not be supported, cause if the value of the threadprivate variable after the parallel loop is guaranteed to me equal to the one before entering the parallel loop, then there should not be any memory corruption afterwards.

jimdempseyatthecove · ‎03-25-2020

>>Your first message is known to me, as a threadprivate variable, as the name states, has to be private to the thread. Anything else would not make sense.

The distinction I was trying to make is the distinction between:

integer :: tp
!$omp threadprivate(tp)

and

integer :: iThread
...
!$omp parallel private(iThread)

Within the parallel region, both tp and iThread are private with respect to each thread. IMHO only one is declared as threadprivate. The declared threadprivate variables have persistence between parallel regions whereas private(...) do not.

Maybe I am too picky about semantics?

Can you show the affected user defined type? (you can change names to obscure the purpose if you wish)

>>In my situation some thread may change the memory dimension of the threadprivate variable. Probably this works fine for any thread except the master thread. If however the memory growing takes place for the master thread, then this has an impact on the memory layout after the parallel loop and causes memory corruption. Could this be the case?

Supposition:

1) initial size determined for UDT enclosed array
2) parallel region to perform threadprivate allocations | end parallel region
3) loop
4) parallel region with copyin(yourTypeVar)
5) ... | master thread modifies size of its yourTypeVar%array | others do not
6) end parallel region
7) end loop

In the above case, the size of the master threads threadprivate copy of yourTypeVar%array was enlarged whereas the other threads threadprivate copy of yourTypeVar%array were not enlarged (or not to the same extent). On the next iteration of the loop, the source and target arrays did not have the same size. However the copyin(yourTypeVar%array) assumed (required) the arrays to be of the same size (or at least target >=)

It is not clear as to if Reallocate Left Hand Side rules apply to copyin.

The above is supposition.

In the OpenMP 5.0 spec

24 If the list item is a polymorphic variable with the ALLOCATABLE attribute, the behavior is
25 unspecified.
Fortran

I could not locate a reference to user defined type containing allocatable

Jim Dempsey

Elarion245 · ‎03-25-2020

Hi Jim,

I like your 7-step example. In fact it matches my initial question, except for the case that the original allocation (step 2) takes place in sequential context.

But you mention that on the next iteration the sizes are different, which puts up the question:

I thought that openmp uses a threadpool, so once the work of one thread is finished and a new iteration is assigned to the thread, there is no additional copy-in step. If it is, then the question is indeed of the Reallocate left hand rule then applies.

Also I have only been growing the array. Since the data is indexed by date, the growing does not shift any indices, even if the threadprivate globals persisted between thread iterations.

I would like to stress again, that I ran massive simulations on 36 threads over a day using this setup (without knowing that the array was sometimes enlarged in some threads). It always worked, however the code after the massive parallel loop contained no or maybe only 1-2 allocate statements. Now that I added another post-processing step I started to observe the memory corruptions. So for me it seems that simply by the observations, that it works while the parallel area is active, however once it exits, then there is corruption.

Regarding the overall type, this is it with different names and without comments:

Type tGlobal
    integer(i4)                                       :: a            
    Type(tEOP), allocatable, dimension(:)             :: EOP  !-> this is the allocatable array which got changed                
    Character(len=48)                                 :: b              
    Character(len=48)                                 :: c
    Character(len=48)                                 :: d
    Character(len=48)                                 :: e   
    integer(i4)                                       :: f            
    logical                                           :: g  
    logical                                           :: h 
    integer(i4)                                       :: i       
    logical                                           :: j = .False.
    logical                                           :: k
    logical                                           :: l
    logical                                           :: m
    logical                                           :: n
    logical                                           :: o
    logical                                           :: p
    real(dp)                                          :: q
    real(dp)                                          :: r
    integer(i4)                                       :: s
    integer(i4)                                       :: t
    real(dp)                                          :: u
    real(dp)                                          :: v
    integer(i4)                                       :: w
    integer(i4)                                       :: x
End Type tGlobal

jimdempseyatthecove · ‎03-25-2020

I wrote a test program:

!  ThreadPrivateGoodBad.f90 
module sample
type tp_t
    integer, allocatable :: a(:)
end type tp_t

type(tp_t) :: tp
!$omp threadprivate(tp)
end module sample
    
program ThreadPrivateGoodBad
    use sample
    use omp_lib
    
    implicit none

    integer :: initialSize, i
    
    call omp_set_num_threads(2)
    
    initialSize = 10
    
    ! preallocate workspace
    !$omp parallel
    allocate(tp%a(initialSize))
    !$omp end parallel
    
    do i=1,size(tp%a)
        tp%a(i) = i
    end do
    
    print *,"region without expansion"
    !$omp parallel copyin(tp)
    print *, omp_get_thread_num(), tp%a
    !$omp end parallel
    
    do i=1,size(tp%a)
        tp%a(i) = i*10
    end do
    
    print *,"region with expansion by master"
    !$omp parallel copyin(tp)
    print *, omp_get_thread_num(), tp%a
    !$omp master
    call bigger
    !$omp end master
    !$omp end parallel

    print *,"region after expansion by master"
    !$omp parallel copyin(tp)
    print *, omp_get_thread_num(), tp%a
    !$omp end parallel

end program ThreadPrivateGoodBad
    
subroutine bigger
    use sample
    integer, allocatable :: a(:)
    allocate(a(size(tp%a) + 10)) ! local a is larger
    a(1:size(tp%a)) = tp%a ! copy
    a(size(tp%a)+1:size(a)) = -1 ! mark remainder as unused
    ! local a should be automatically deallocated here
end subroutine bigger

Output:
 region without expansion
           0           1           2           3           4           5
           6           7           8           9          10
           1           1           2           3           4           5
           6           7           8           9          10
 region with expansion by master
           1          10          20          30          40          50
          60          70          80          90         100
           0          10          20          30          40          50
          60          70          80          90         100
 region after expansion by master
           1          10          20          30          40          50
          60          70          80          90         100
           0          10          20          30          40          50
          60          70          80          90         100

This illustrates that for user defined types containing allocatables that the allocatabel got enlarged as part of the copyin

Jim Demspey

Elarion245 · ‎03-25-2020

Your edit is interesting, but I am not sure how to relate it exactly. Let's assume that nothing is changed within a parallel region. Then I have no problem at all using the the tGlobal-type variable in a copyin clause. Everything works and the data within the threads is initialized using the master-thread's data.

From the way I read the 5.0 spec (I am using openmp 4.5 btw), this sounds like we cannot have any type containing an allocatable sub-component in a copyin clause, which would be shocking.

It looks though as if changing the array is undefined behavior. Technically speaking however, is there anything that prohobits this from being a feature request?

jimdempseyatthecove · ‎03-25-2020

In your case, you have a nested UDT.

what is your Type(tEOP)

Jim Dempsey

jimdempseyatthecove · ‎03-25-2020

>>this sounds like we cannot have any type containing an allocatable sub-component in a copyin clause

Spec stated polymorphic, UDT alone is not polymorphic

Jim

Elarion245 · ‎03-25-2020

Now we are typing at the same time. Great test program!

I see that you pre-allocated the space for each array copy. I did not do that (I thought it is only required if the main object is allocatable and not one of its sub-components. But maybe this was already the issue?

Your code demonstrates that the master-threads changes persist also after the parallel region. Interestingly your code does not result in an access violation. So I wonder if this is because I did not do the allocation of the array component in a parallel section before, or if there is another reason (maybe your code and the number of threads is too small to result in the issue to show up? or maybe one should define the array in tp to be of a custom type as well?)

You are right about the UDT (=user defined type) and the polymorphism obviously!

-> So your example works because it is not a UDT, but an integer array?

tEOP looks like this and contains the astrodynamics data relating to a certain day:

Type tEOP
    integer(i4)                               :: a               
    integer(i4)                               :: b                 
    integer(i4)                               :: c                
    integer(i4)                               :: d                  
    integer(i4)                               :: e                  
    real(dp)                                  :: f                   
    real(dp)                                  :: g                   
    real(dp)                                  :: h                 
    real(dp)                                  :: i                  
    real(dp)                                  :: j                 
    real(dp)                                  :: k             
    real(dp)                                  :: l                  
    real(dp)                                  :: m                  
    integer(i4)                               :: n                  
    real(dp)                                  :: o            
    real(dp)                                  :: p            
    real(dp)                                  :: q          
    real(dp)                                  :: r           
    real(dp)                                  :: s          
    real(dp)                                  :: t      
    real(dp)                                  :: u           
    real(dp)                                  :: v           
End Type tEOP

jimdempseyatthecove · ‎03-25-2020

This seems to work:

!  ThreadPrivateGoodBad.f90 
module sample
integer, parameter :: i4 = 4
integer, parameter :: dp = 8

Type tEOP
    integer(i4)                               :: a               
    integer(i4)                               :: b                 
    integer(i4)                               :: c                
    integer(i4)                               :: d                  
    integer(i4)                               :: e                  
    real(dp)                                  :: f                   
    real(dp)                                  :: g                   
    real(dp)                                  :: h                 
    real(dp)                                  :: i                  
    real(dp)                                  :: j                 
    real(dp)                                  :: k             
    real(dp)                                  :: l                  
    real(dp)                                  :: m                  
    integer(i4)                               :: n                  
    real(dp)                                  :: o            
    real(dp)                                  :: p            
    real(dp)                                  :: q          
    real(dp)                                  :: r           
    real(dp)                                  :: s          
    real(dp)                                  :: t      
    real(dp)                                  :: u           
    real(dp)                                  :: v           
End Type tEOP

Type tGlobal
    integer(i4)                                       :: a            
    Type(tEOP), allocatable, dimension(:)             :: EOP  !-> this is the allocatable array which got changed                
    Character(len=48)                                 :: b              
    Character(len=48)                                 :: c
    Character(len=48)                                 :: d
    Character(len=48)                                 :: e   
    integer(i4)                                       :: f            
    logical                                           :: g  
    logical                                           :: h 
    integer(i4)                                       :: i       
    logical                                           :: j = .False.
    logical                                           :: k
    logical                                           :: l
    logical                                           :: m
    logical                                           :: n
    logical                                           :: o
    logical                                           :: p
    real(dp)                                          :: q
    real(dp)                                          :: r
    integer(i4)                                       :: s
    integer(i4)                                       :: t
    real(dp)                                          :: u
    real(dp)                                          :: v
    integer(i4)                                       :: w
    integer(i4)                                       :: x
End Type tGlobal

type(tGlobal) :: tp
!$omp threadprivate(tp)
end module sample
    
program ThreadPrivateGoodBad
    use sample
    use omp_lib
    
    implicit none

    integer :: initialSize, i
    
    call omp_set_num_threads(2)
    
    initialSize = 10
    
    ! preallocate workspace
    !$omp parallel
    allocate(tp%EOP(initialSize))
    !$omp end parallel
    
    do i=1,size(tp%EOP)
        tp%EOP(i)%a = i
    end do
    
    print *,"region without expansion"
    !$omp parallel copyin(tp)
    print *, omp_get_thread_num(), tp%EOP(1)%a, size(tp%EOP)
    !$omp end parallel
    
    do i=1,size(tp%EOP)
        tp%EOP(i)%a = i*10
    end do
    
    print *,"region with expansion by master"
    !$omp parallel copyin(tp)
    print *, omp_get_thread_num(), tp%EOP(1)%a, size(tp%EOP)
    !$omp master
    call bigger
    !$omp end master
    !$omp end parallel

    print *,"region after expansion by master"
    !$omp parallel copyin(tp)
    print *, omp_get_thread_num(), tp%EOP(1)%a, size(tp%EOP)
    !$omp end parallel

end program ThreadPrivateGoodBad
    
subroutine bigger
    use sample
    Type(tEOP), allocatable :: EOP(:)
    allocate(EOP(size(tp%EOP) + 10)) ! local EOP is larger
    EOP(1:size(tp%EOP)) = tp%EOP
    EOP(size(tp%EOP)+1:size(EOP))%a = -1 ! mark remainder as unused
    tp%EOP = EOP ! copy with rlhs
    ! local EOP should be automatically deallocated here
end subroutine bigger

 region without expansion
           0           1          10
           1           1          10
 region with expansion by master
           1          10          10
           0          10          10
 region after expansion by master
           1          10          20
           0          10          20

Jim Dempsey

jimdempseyatthecove · ‎03-25-2020

If your array expansion is due to an array expression that creates a stack temporary array, the crash my be due to stack overflow. You can mitigate this with explicit programming (create your own temp on heap).

Jim Dempsey

Elarion245 · ‎03-25-2020

Jim, thanks a lot for your effort!

So, I think we can agree on:

- Your second test shows that it apparently works, yet it is undefined behavior, right? So I guess one cannot rely on this to work?

In my case this also worked. But when I put lots of data-postprocessing computations in, then any

allocatable could suddenly give me an access violation.

If you open up the stackoverflow-thread, you see that the memory corruption resulted in an arbitrary allocate-statement of a sequential (non-parallel) code to attempt to load OMP-based allocation code. The final crashes were always within frontend.cpp of OMP, which apparently uses TBB somehow, as this file belongs to TBB.

The EOP reloading code used to look like this (executed in a critical section):

call read_EOP(EOPfile=global%EOPfile, start_time=start_time, end_time=end_time,&
                      EOP_out=global%EOP)

So it would directly overwrite the array-component. Now I use a local copy of the EOP array for this any only write it back to the global variable if OMP_IN_PARALLEL() does return true.

jimdempseyatthecove · ‎03-25-2020

>>yet it is undefined behavior, right?

UDT behavior is defined, polymorphic UDT is undefined.

IOW because COPYIN does not have the optional argument of ",MOLD=xxxx" it knows not how to perform the copy. Should be such an option, then it could select the appropriate (MOLD) type from the polymorphic type. Your tGlobal type is not polymorphic, it should be fine to copyin.

RE: EOP reloading

I was not aware that your reloading came from file I/O.

Is your OPEN, READ(s), CLOSE within read_EOP protected by an !$omp critical section?

If not, then the file position pointer could get bunged up, and/or your OPEN may be using the same unit number for all threads, ...

Jim Dempsey

Elarion245 · ‎03-25-2020

Hi Jim,

as mentioned above, in parenthesis there is a named omp critical block around the read_EOP routine, so this should be fine.

I am a bit confused now. What do we know?:

1) The copyin of the type is fine

2) Your example shows that growing from within is possible and that if the master thread changes the dimensions of the array then they persist also after the loop

We do not know, if:

3) Does the change persist if any other thread changes the array dimensions?

4) Even though the loop execution looks fine, there could be memory corruption that has not been revealed?

jimdempseyatthecove · ‎03-25-2020

>>3) Does the change persist if any other thread changes the array dimensions?

No. The copyin will use the master threads copy and values.

you may be able to use associate( tp0 => tp) outside the parallel region such that all threads have access to master threads tp inside the parallel region.
But be careful about how you modify the access. Change could be done in a barrier within the parallel region.

>>4) Even though the loop execution looks fine, there could be memory corruption that has not been revealed?

Full runtime checks in Debug build indicate no corruption.

Jim Dempsey