OpenMP and private fields which have been dynamically allocated

Lars_Pastewka · ‎04-29-2009

Hello Everybody!

When trying to run the following code

[plain]subroutine test_allocate(n1)
  implicit none
  integer, intent(in)   :: n1
  ! ---
  real(8), allocatable  :: data(:)
  integer               :: i
  ! ---
  allocate(data(n1))
  data(:)  = 0.0d0
  !$omp  parallel do default(none) &
  !$omp& reduction(+:data)
  do i = 1, 1000
     !$omp critical
     write (*, *)  i, allocated(data)
     !$omp end critical
     data(10)  = data(10) + 1.0d0
  enddo
  deallocate(data)
endsubroutine test_allocate
[/plain]

with e.g. call test_allocate(1000000) the code bails out with the error message

forrtl: severe (408): fort: (8): Attempt to fetch from allocatable variable DATA when it is not allocated

Also, allocated(data) returns false. In this particular case I compiled with "-g -O0 -traceback -check bounds -check pointers -openmp" using ifort 11.0 on Linux. Without "-check pointer" I get a SEGFAULT. Also changing the optimization flags does not change the problem. What helps is using automatic arrays instead of "allocate", however, these require a lot of stack and I want to avoid that.

In my opinion this is valid code within the OpenMP 3.0 specs. The same error also occurs if I change the reduction clause to a simple private field.

Has anyone experienced something like that. I would very much appreciate help.

Thanks in advance,
Lars

TimP · ‎04-29-2009

You would want error checking in the allocate to see if it was successful.
Then, it doesn't make sense to use an array element as a reduction variable. It's not a normal well-tested idiom, possibly not even considered by the developers of OpenMP standard or libraries. In addition to the transformations which the compiler must make to implement this, it is likely to try to optimize it within the thread local code.
Intel OpenMP is advertised as implementing OpenMP 3.0, but OpenMP 2.5 implementation isn't finished to a useful extent. Not that this has a direct bearing on the present question.

Lars_Pastewka · ‎04-29-2009

This is just a demonstration of the error. In the production implementation, the array index of "data" changes such that the same index might be accessed concurrently. I did a reduction on an array variable to avoid having to enclose that in an !$omp atomic construct which leads to code that is slower than the serial version. I would appreciate if you knew an alternative to that approach since it uses a lot of memory.

Lars

TimP · ‎04-29-2009

Quoting - Lars Pastewka

This is just a demonstration of the error. In the production implementation, the array index of "data" changes such that the same index might be accessed concurrently. I did a reduction on an array variable to avoid having to enclose that in an !$omp atomic construct which leads to code that is slower than the serial version. I would appreciate if you knew an alternative to that approach since it uses a lot of memory.

Lars

If applicable, I would try a scalar for the reduction variable, and use the atomic construct after the loop to add it to the array element.
!$omp parallel

!$omp do reduction(+:sum)
do
....
enddo
!$omp atomic
data(10) = data(10) + sum
!$omp end parallel

I'm not sure how your description matches the example, as to how the data array is shared, or whether sum needs more than the lastprivate flavor given it by reduction. As Jim pointed out, the reduction variable would be zeroed, and so you couldn't accumulate results from multiple reduction examples sharing by your direct use of an array element.
Reduction loop has a sizeable overhead, and often requires a larger loop count to be worth while, compared with simple OpenMP parallel loops.

IDZ_A_Intel · ‎04-29-2009

Lars,

Variables listed in the reduction clause are implicitly private. In your case data is an array, well actually, data is an array descriptor.

reduction(+: variables are implicitly set to 0.

You have some problems:

1) there is no copyin for copying in the contents of the array descriptor
2) even with copyin(data) would you expect the reduction clause to zero the array descriptor or zero the contents of the array?
3) if you expect the reduction clause to zero the contents of the array thenyou also mustexpect each thread to have a local copy of array data (for purposes of reduction) and then each thread zeroes each element of its private copy of the array. (where is the allocation for this copy of the array?) Do you really want number of threads number of temporary arrays (each 8,000,000 bytes or so)?

(things are getting computationally uggly here if the code does what you expect)

I suggest you rewrite the code such that arrays are not used for reduction. Use atomic or explicitly create your own seperate copies of arrays for loop internal reductions (seperate accumulators).

Could you post a more practical example code showing what you want to do.

Jim Dempsey

Lars_Pastewka · ‎04-30-2009

Jim,

the code under consideration is a molecular dynamics code, and the subroutine computes some kind of interaction potential. The simplest example would be direct summation of Coulomb interactions, i.e. something like

[plain]subroutine coulomb(nat, r, q, phi, E, epot)
  implicit none
  integer, intent(in)     :: nat
  real(8), intent(in)     :: r(3, nat)
  real(8), intent(in)     :: q(nat)
  real(8), intent(inout)  :: phi(nat)
  real(8), intent(inout)  :: E(3, nat)
  real(8), intent(inout)  :: epot
  ! --
  integer  :: i, j
  real(8)  :: abs_dr, dr(3), df(3), q_i
  ! ---
  !$omp  parallel do default(none) &
  !$omp& shared(nat, q, r) &
  !$omp& private(abs_dr, df, dr, j, q_i) &
  !$omp& reduction(+:epot) reduction(+:phi) reduction(+:E)
  do i = 1, nat-1
     q_i  = q(i)
     do j = i+1, nat
        dr(:)    = r(:, i) - r(:, j)
        abs_dr   = sqrt(dot_product(dr(:), dr(:)))
        df(:)    = dr(:)/(abs_dr**3)
        E(:, i)  = E(:, i) + q(j)*df(:)
        E(:, j)  = E(:, j) - q_i*df(:)
        phi(i)   = phi(i) + q(j)/abs_dr
        phi(j)   = phi(j) + q_i/abs_dr
        epot     = epot + q_i*q(j)/abs_dr
     enddo
  enddo
endsubroutine coulomb[/plain]

where r are particle positions, q charges, phi the electostatic potential, E the electric field and epot the total potential energy.

*phi* for example is accessed as phi(i) and phi(j) leading to concurrent writes to the same memory location if the array was just shared. In an example I just checked, if I use atomic constructs (i.e. !$omp atomic, phi(i) = phi(i) + ...) the execution time is 60s vs 40s for the version of the code using a reduction clause. This is using 4 threads and nat=13480 and it becomes much worse if the number of threads is increased.

Up to now I was happy with the code but as *nat* increases I run into problems with stack. This is the reason I tried to dynamically allocate the arrays in question. If you look into the OpenMP 3.0 specs, this is valid. A copy of the data object should be allocated for each thread and deallocated when the code exits the parallel region.

Lars

TimP · ‎04-30-2009

Quoting - Lars Pastewka

[plain]subroutine coulomb(nat, r, q, phi, E, epot)
  implicit none
  integer, intent(in)     :: nat
  real(8), intent(in)     :: r(3, nat)
  real(8), intent(in)     :: q(nat)
  real(8), intent(inout)  :: phi(nat)
  real(8), intent(inout)  :: E(3, nat)
  real(8), intent(inout)  :: epot
  ! --
  integer  :: i, j
  real(8)  :: abs_dr, dr(3), df(3), q_i
  ! ---
  !$omp  parallel do default(none) &
  !$omp& shared(nat, q, r) &
  !$omp& private(abs_dr, df, dr, j, q_i) &
  !$omp& reduction(+:epot) reduction(+:phi) reduction(+:E)
  do i = 1, nat-1
     q_i  = q(i)
     do j = i+1, nat
        dr(:)    = r(:, i) - r(:, j)
        abs_dr   = sqrt(dot_product(dr(:), dr(:)))
        df(:)    = dr(:)/(abs_dr**3)
        E(:, i)  = E(:, i) + q(j)*df(:)
        E(:, j)  = E(:, j) - q_i*df(:)
        phi(i)   = phi(i) + q(j)/abs_dr
        phi(j)   = phi(j) + q_i/abs_dr
        epot     = epot + q_i*q(j)/abs_dr
     enddo
  enddo
endsubroutine coulomb[/plain]

I don't think phi is correctly designated as a reduction, nor should it require atomic.
Where you do need atomic or reduction, in analogous situations with nested loops inside a parallel, I accumulate into a private variable in the inner loop, and put the atomic or reduction operation in the outer loop.
You may need schedule(guided) for better performance, but that could expose race errors more often.

!$omp parallel do private(epot_inner) reduction(+:epot)
do
..
epot_inner=0
do j=....
epot_inner = epot_inner + ....
enddo
epot = epot + epot_inner
enddo

jimdempseyatthecove · ‎04-30-2009

Lars,

Something like this might work, and work better than your intended reduction (alghough it has more coding effort)

[cpp]module mod_tls
type    TypeThreadContext
    SEQUENCE
    real(8), pointer :: phi(:),E(:,:)
end type TypeThreadContext

type(TypeThreadContext) :: tls
COMMON /CONTEXT/ tls
!$OMP THREADPRIVATE(/CONTEXT/)

contains
subroutine init_phi_E(nat)
    integer, intent(in) :: nat
    if(associated(tls%phi)) then
        if(size(tls%phi) .lt. nat) then
            deallocate(tls%phi,tls%E)
            allocate(tls%phi(nat),tls%E(3,nat))
        endif
    else
        allocate(tls%phi(nat),tls%E(3,nat))
    endif
    tls%phi(1:nat) = 0.0_8
    tls%E(:,1:nat) = 0.0_8
end subroutine init_phi_E

subroutine init_tls
    nullify(tls%phi, tls%E)
end subroutine init_tls

subroutine fini_tls
    if(associated(tls%phi)) delete(tls%phi)
    if(associated(tls%E)) delete(tls%E)
    nullify(tls%phi, tls%E)
end subroutine fini_tls
end module mod_tls

program YourProgram
    use mod_tls
! ... near top 
!$omp parallel
    call init_tls
!$omp end parallel

! ... your application code here

! ... near bottom 
!$omp parallel
    call fini_tls
!$omp end parallel
end program YourProgram

subroutine coulomb(nat, r, q, phi, E, epot)
    use mod_tls
    use omp_lib
    implicit none
    integer, intent(in) :: nat
    real(8), intent(in) :: r(3, nat)
    real(8), intent(in) :: q(nat)
    real(8), intent(inout) :: phi(nat)
    real(8), intent(inout) :: E(3, nat)
    real(8), intent(inout) :: epot
    ! --
    integer  :: i, j, i_stride
    real(8) :: abs_dr, dr(3), df(3), q_i
    ! ---
    phi = 0.0_8
    E = 0.0_8
!$omp parallel default(none) &
!$omp& shared(nat, q, r)
    call init_phi_E(nat)
!$omp do default(none) &
!$omp& private(abs_dr, df, dr, j, q_i) &
!$omp& reduction(+:epot)
    do i = 1, nat-1
        q_i  = q(i)
        do j = i+1, nat
            dr(:) = r(:, i) - r(:, j)
            abs_dr = sqrt(dot_product(dr(:), dr(:)))
            df(:) = dr(:)/(abs_dr**3)
            tls%E(:, i) = tls%E(:, i) + q(j)*df(:)
            tls%E(:, j) = tls%E(:, j) - q_i*df(:)
            tls%phi(i) = tls%phi(i) + q(j)/abs_dr
            tls%phi(j) = tls%phi(j) + q_i/abs_dr
            epot = epot + q_i*q(j)/abs_dr
        enddo
    enddo
    ! non-parallel do
    do i=0, omp_get_num_threads()-1
!$omp barrier    
        do j = 1 + mod(i + omp_get_thread_num(), &
                       & omp_get_num_threads()), &
                       & nat, omp_get_num_threads()
            E(:,j) = E(:,j) + tls%E(:,j)
            phi(j) = phi(j) + tls%phi(j)
        enddo
    enddo
!$omp end parallel
endsubroutine coulomb

[/cpp]

The additional benefit is the thread local storage is allocated only when necessary and persists across calls (reduces allocations to minimum).

If you need to reclaim memory during run you can call fini_tls

Jim Dempsey

Lars_Pastewka · ‎05-01-2009

Jim,

thanks for the advice, I'll try your code. Actually, I would have expected the compiler to automatically generate something like this usingthe reduction clause on an *allocatable* array (except for maybe, that the arrays would have been allocated and deallocated eachentry and exit). I still believe the error I reported above is a compiler bug.

Thanks,

Lars

jimdempseyatthecove · ‎05-01-2009

Depending on the execution time of the parallel region the allocation/deallocation time may be insignificant.

What is significant though is high frequency allocation/deallocation of large memory blocks tend to fragment memory. This results in your application growing in virtual memory footprint thus consuming more memory resources and potentially incurring additional paging. One of those situations were the program runs file during test/debug but sucks during production runs.

Jim Dempsey

Lars_Pastewka · ‎05-01-2009

Thank you very much, it works flawlessly now.

Lars

jimdempseyatthecove · ‎05-01-2009

Quoting - Lars Pastewka

Thank you very much, it works flawlessly now.

Lars

Lars,

For safety reasons I suggest you change

real(8) :: abs_dr, dr(3), df(3), q_i

to

real(8) :: abs_dr, q_i
real(8), automatic::dr(3), df(3)

Without automatic, then depending on option switches dr(3) and df(3) will be SAVE.
Some here prefer marking the subroutine as RECURSIVE.

I will follow this post with a suggestion that will eek a little more performance out of the routine.

Jim Dempsey

Jim Dempsey

jimdempseyatthecove · ‎05-01-2009

The following code is a trade-off between using tls%phi and tls%E in the inner loop against recreating array descriptors in a call made in the outer loop. The tradeoff may or may not yield better performance.

I have not compiled and run the code. Prior to going to the effort to test the following code, compile the one you have now (with tls%phi and tls%E) with full optimizations enabled in a Debug configuration. Place a break point inside the inner loop and see if the optimization was smart enough to pull the Thread Local Storage dereferenceingout the loop. If it does, then good for the IVF compiler team. If not, then try the following code

[cpp]! =======================================
subroutine coulombThread(i, nat, r, q, phi, E, epot)   
    use mod_tls   
    use omp_lib   
    implicit none   
    integer, intent(in) :: i   
    integer, intent(in) :: nat   
    real(8), intent(in) :: r(3, nat)   
    real(8), intent(in) :: q(nat)   
    real(8), intent(inout) :: phi(nat)   
    real(8), intent(inout) :: E(3, nat)   
    real(8), intent(inout) :: epot   
    ! --   
    integer  :: j, i_stride   
    real(8) :: abs_dr, q_i   
    real(8), automatic :: dr(3), df(3)   
    ! --- 
    ! phi, E, epot cleared prior to entry
    ! Note, what this subroutine refers to
    ! as phi and E internally are externally
    !  tls%phi and tls%E
    q_i  = q(i)   
    do j = i+1, nat   
        dr(:) = r(:, i) - r(:, j)   
        abs_dr = sqrt(dot_product(dr(:), dr(:)))   
        df(:) = dr(:)/(abs_dr**3)   
        E(:, i) = E(:, i) + q(j)*df(:)   
        E(:, j) = E(:, j) - q_i*df(:)   
        phi(i) = phi(i) + q(j)/abs_dr   
        phi(j) = phi(j) + q_i/abs_dr   
        epot = epot + q_i*q(j)/abs_dr   
    enddo   
endsubroutine coulombThread

subroutine coulomb(nat, r, q, phi, E, epot)   
    use mod_tls   
    use omp_lib   
    implicit none   
    integer, intent(in) :: nat   
    real(8), intent(in) :: r(3, nat)   
    real(8), intent(in) :: q(nat)   
    real(8), intent(inout) :: phi(nat)   
    real(8), intent(inout) :: E(3, nat)   
    real(8), intent(inout) :: epot   
    ! --   
    integer  :: i, j   
    ! --- 
!$omp parallel default(none) &   
!$omp& private(j) &   
!$omp& shared(nat, r, q, phi, E, epot)
    ! initialize (if necessary) Thread Local Storage
    ! copies of phi and E in tls%phi and tls%E   
    ! tls%phi, tls%E, cleared in init_phi_E
    call init_phi_E(nat)   
!$omp do &   
!$omp& reduction(+:epot)   
    do i = 1, nat-1
        ! call coulombThread using tls%phi, tls%E
        ! removes some overhead in referencing arrays
        call coulombThread(i, nat, r, q, tls%phi, tls%E, epot)   
    enddo   
    ! non-parallel do
    ! run in parallel by each thread
    ! (performs reduction without requirement of atomics)
    do i=0, omp_get_num_threads()-1   
!$omp barrier       
        do j = 1 + mod(i + omp_get_thread_num(), & 
                       & omp_get_num_threads()), & 
                       & nat, omp_get_num_threads()   
            E(:,j) = E(:,j) + tls%E(:,j)   
            phi(j) = phi(j) + tls%phi(j)   
        enddo   
    enddo   
!$omp end parallel   
endsubroutine coulomb
[/cpp]

Jim Dempsey