- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello Everybody!
When trying to run the following code
with e.g. call test_allocate(1000000) the code bails out with the error message
forrtl: severe (408): fort: (8): Attempt to fetch from allocatable variable DATA when it is not allocated
Also, allocated(data) returns false. In this particular case I compiled with "-g -O0 -traceback -check bounds -check pointers -openmp" using ifort 11.0 on Linux. Without "-check pointer" I get a SEGFAULT. Also changing the optimization flags does not change the problem. What helps is using automatic arrays instead of "allocate", however, these require a lot of stack and I want to avoid that.
In my opinion this is valid code within the OpenMP 3.0 specs. The same error also occurs if I change the reduction clause to a simple private field.
Has anyone experienced something like that. I would very much appreciate help.
Thanks in advance,
Lars
When trying to run the following code
[plain]subroutine test_allocate(n1) implicit none integer, intent(in) :: n1 ! --- real(8), allocatable :: data(:) integer :: i ! --- allocate(data(n1)) data(:) = 0.0d0 !$omp parallel do default(none) & !$omp& reduction(+:data) do i = 1, 1000 !$omp critical write (*, *) i, allocated(data) !$omp end critical data(10) = data(10) + 1.0d0 enddo deallocate(data) endsubroutine test_allocate [/plain]
with e.g. call test_allocate(1000000) the code bails out with the error message
forrtl: severe (408): fort: (8): Attempt to fetch from allocatable variable DATA when it is not allocated
Also, allocated(data) returns false. In this particular case I compiled with "-g -O0 -traceback -check bounds -check pointers -openmp" using ifort 11.0 on Linux. Without "-check pointer" I get a SEGFAULT. Also changing the optimization flags does not change the problem. What helps is using automatic arrays instead of "allocate", however, these require a lot of stack and I want to avoid that.
In my opinion this is valid code within the OpenMP 3.0 specs. The same error also occurs if I change the reduction clause to a simple private field.
Has anyone experienced something like that. I would very much appreciate help.
Thanks in advance,
Lars
Link Copied
12 Replies
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You would want error checking in the allocate to see if it was successful.
Then, it doesn't make sense to use an array element as a reduction variable. It's not a normal well-tested idiom, possibly not even considered by the developers of OpenMP standard or libraries. In addition to the transformations which the compiler must make to implement this, it is likely to try to optimize it within the thread local code.
Intel OpenMP is advertised as implementing OpenMP 3.0, but OpenMP 2.5 implementation isn't finished to a useful extent. Not that this has a direct bearing on the present question.
Then, it doesn't make sense to use an array element as a reduction variable. It's not a normal well-tested idiom, possibly not even considered by the developers of OpenMP standard or libraries. In addition to the transformations which the compiler must make to implement this, it is likely to try to optimize it within the thread local code.
Intel OpenMP is advertised as implementing OpenMP 3.0, but OpenMP 2.5 implementation isn't finished to a useful extent. Not that this has a direct bearing on the present question.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
This is just a demonstration of the error. In the production implementation, the array index of "data" changes such that the same index might be accessed concurrently. I did a reduction on an array variable to avoid having to enclose that in an !$omp atomic construct which leads to code that is slower than the serial version. I would appreciate if you knew an alternative to that approach since it uses a lot of memory.
Lars
Lars
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - Lars Pastewka
This is just a demonstration of the error. In the production implementation, the array index of "data" changes such that the same index might be accessed concurrently. I did a reduction on an array variable to avoid having to enclose that in an !$omp atomic construct which leads to code that is slower than the serial version. I would appreciate if you knew an alternative to that approach since it uses a lot of memory.
Lars
Lars
!$omp parallel
!$omp do reduction(+:sum)
do
....
enddo
!$omp atomic
data(10) = data(10) + sum
!$omp end parallel
I'm not sure how your description matches the example, as to how the data array is shared, or whether sum needs more than the lastprivate flavor given it by reduction. As Jim pointed out, the reduction variable would be zeroed, and so you couldn't accumulate results from multiple reduction examples sharing by your direct use of an array element.
Reduction loop has a sizeable overhead, and often requires a larger loop count to be worth while, compared with simple OpenMP parallel loops.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Lars,
Variables listed in the reduction clause are implicitly private. In your case data is an array, well actually, data is an array descriptor.
reduction(+: variables are implicitly set to 0.
You have some problems:
1) there is no copyin for copying in the contents of the array descriptor
2) even with copyin(data) would you expect the reduction clause to zero the array descriptor or zero the contents of the array?
3) if you expect the reduction clause to zero the contents of the array thenyou also mustexpect each thread to have a local copy of array data (for purposes of reduction) and then each thread zeroes each element of its private copy of the array. (where is the allocation for this copy of the array?) Do you really want number of threads number of temporary arrays (each 8,000,000 bytes or so)?
(things are getting computationally uggly here if the code does what you expect)
I suggest you rewrite the code such that arrays are not used for reduction. Use atomic or explicitly create your own seperate copies of arrays for loop internal reductions (seperate accumulators).
Could you post a more practical example code showing what you want to do.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Jim,
the code under consideration is a molecular dynamics code, and the subroutine computes some kind of interaction potential. The simplest example would be direct summation of Coulomb interactions, i.e. something like
where r are particle positions, q charges, phi the electostatic potential, E the electric field and epot the total potential energy.
*phi* for example is accessed as phi(i) and phi(j) leading to concurrent writes to the same memory location if the array was just shared. In an example I just checked, if I use atomic constructs (i.e. !$omp atomic, phi(i) = phi(i) + ...) the execution time is 60s vs 40s for the version of the code using a reduction clause. This is using 4 threads and nat=13480 and it becomes much worse if the number of threads is increased.
Up to now I was happy with the code but as *nat* increases I run into problems with stack. This is the reason I tried to dynamically allocate the arrays in question. If you look into the OpenMP 3.0 specs, this is valid. A copy of the data object should be allocated for each thread and deallocated when the code exits the parallel region.
Lars
the code under consideration is a molecular dynamics code, and the subroutine computes some kind of interaction potential. The simplest example would be direct summation of Coulomb interactions, i.e. something like
[plain]subroutine coulomb(nat, r, q, phi, E, epot)
implicit none
integer, intent(in) :: nat
real(8), intent(in) :: r(3, nat)
real(8), intent(in) :: q(nat)
real(8), intent(inout) :: phi(nat)
real(8), intent(inout) :: E(3, nat)
real(8), intent(inout) :: epot
! --
integer :: i, j
real(8) :: abs_dr, dr(3), df(3), q_i
! ---
!$omp parallel do default(none) &
!$omp& shared(nat, q, r) &
!$omp& private(abs_dr, df, dr, j, q_i) &
!$omp& reduction(+:epot) reduction(+:phi) reduction(+:E)
do i = 1, nat-1
q_i = q(i)
do j = i+1, nat
dr(:) = r(:, i) - r(:, j)
abs_dr = sqrt(dot_product(dr(:), dr(:)))
df(:) = dr(:)/(abs_dr**3)
E(:, i) = E(:, i) + q(j)*df(:)
E(:, j) = E(:, j) - q_i*df(:)
phi(i) = phi(i) + q(j)/abs_dr
phi(j) = phi(j) + q_i/abs_dr
epot = epot + q_i*q(j)/abs_dr
enddo
enddo
endsubroutine coulomb[/plain]
where r are particle positions, q charges, phi the electostatic potential, E the electric field and epot the total potential energy.
*phi* for example is accessed as phi(i) and phi(j) leading to concurrent writes to the same memory location if the array was just shared. In an example I just checked, if I use atomic constructs (i.e. !$omp atomic, phi(i) = phi(i) + ...) the execution time is 60s vs 40s for the version of the code using a reduction clause. This is using 4 threads and nat=13480 and it becomes much worse if the number of threads is increased.
Up to now I was happy with the code but as *nat* increases I run into problems with stack. This is the reason I tried to dynamically allocate the arrays in question. If you look into the OpenMP 3.0 specs, this is valid. A copy of the data object should be allocated for each thread and deallocated when the code exits the parallel region.
Lars
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - Lars Pastewka
[plain]subroutine coulomb(nat, r, q, phi, E, epot)
implicit none
integer, intent(in) :: nat
real(8), intent(in) :: r(3, nat)
real(8), intent(in) :: q(nat)
real(8), intent(inout) :: phi(nat)
real(8), intent(inout) :: E(3, nat)
real(8), intent(inout) :: epot
! --
integer :: i, j
real(8) :: abs_dr, dr(3), df(3), q_i
! ---
!$omp parallel do default(none) &
!$omp& shared(nat, q, r) &
!$omp& private(abs_dr, df, dr, j, q_i) &
!$omp& reduction(+:epot) reduction(+:phi) reduction(+:E)
do i = 1, nat-1
q_i = q(i)
do j = i+1, nat
dr(:) = r(:, i) - r(:, j)
abs_dr = sqrt(dot_product(dr(:), dr(:)))
df(:) = dr(:)/(abs_dr**3)
E(:, i) = E(:, i) + q(j)*df(:)
E(:, j) = E(:, j) - q_i*df(:)
phi(i) = phi(i) + q(j)/abs_dr
phi(j) = phi(j) + q_i/abs_dr
epot = epot + q_i*q(j)/abs_dr
enddo
enddo
endsubroutine coulomb[/plain]
Where you do need atomic or reduction, in analogous situations with nested loops inside a parallel, I accumulate into a private variable in the inner loop, and put the atomic or reduction operation in the outer loop.
You may need schedule(guided) for better performance, but that could expose race errors more often.
!$omp parallel do private(epot_inner) reduction(+:epot)
do
..
epot_inner=0
do j=....
epot_inner = epot_inner + ....
enddo
epot = epot + epot_inner
enddo
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Lars,
Something like this might work, and work better than your intended reduction (alghough it has more coding effort)
[cpp]module mod_tls type TypeThreadContext SEQUENCE real(8), pointer :: phi(:),E(:,:) end type TypeThreadContext type(TypeThreadContext) :: tls COMMON /CONTEXT/ tls !$OMP THREADPRIVATE(/CONTEXT/) contains subroutine init_phi_E(nat) integer, intent(in) :: nat if(associated(tls%phi)) then if(size(tls%phi) .lt. nat) then deallocate(tls%phi,tls%E) allocate(tls%phi(nat),tls%E(3,nat)) endif else allocate(tls%phi(nat),tls%E(3,nat)) endif tls%phi(1:nat) = 0.0_8 tls%E(:,1:nat) = 0.0_8 end subroutine init_phi_E subroutine init_tls nullify(tls%phi, tls%E) end subroutine init_tls subroutine fini_tls if(associated(tls%phi)) delete(tls%phi) if(associated(tls%E)) delete(tls%E) nullify(tls%phi, tls%E) end subroutine fini_tls end module mod_tls program YourProgram use mod_tls ! ... near top !$omp parallel call init_tls !$omp end parallel ! ... your application code here ! ... near bottom !$omp parallel call fini_tls !$omp end parallel end program YourProgram subroutine coulomb(nat, r, q, phi, E, epot) use mod_tls use omp_lib implicit none integer, intent(in) :: nat real(8), intent(in) :: r(3, nat) real(8), intent(in) :: q(nat) real(8), intent(inout) :: phi(nat) real(8), intent(inout) :: E(3, nat) real(8), intent(inout) :: epot ! -- integer :: i, j, i_stride real(8) :: abs_dr, dr(3), df(3), q_i ! ---
phi = 0.0_8
E = 0.0_8 !$omp parallel default(none) & !$omp& shared(nat, q, r) call init_phi_E(nat) !$omp do default(none) & !$omp& private(abs_dr, df, dr, j, q_i) & !$omp& reduction(+:epot) do i = 1, nat-1 q_i = q(i) do j = i+1, nat dr(:) = r(:, i) - r(:, j) abs_dr = sqrt(dot_product(dr(:), dr(:))) df(:) = dr(:)/(abs_dr**3) tls%E(:, i) = tls%E(:, i) + q(j)*df(:) tls%E(:, j) = tls%E(:, j) - q_i*df(:) tls%phi(i) = tls%phi(i) + q(j)/abs_dr tls%phi(j) = tls%phi(j) + q_i/abs_dr epot = epot + q_i*q(j)/abs_dr enddo enddo ! non-parallel do do i=0, omp_get_num_threads()-1 !$omp barrier do j = 1 + mod(i + omp_get_thread_num(), &
& omp_get_num_threads()), &
& nat, omp_get_num_threads() E(:,j) = E(:,j) + tls%E(:,j) phi(j) = phi(j) + tls%phi(j) enddo enddo !$omp end parallel endsubroutine coulomb [/cpp]
The additional benefit is the thread local storage is allocated only when necessary and persists across calls (reduces allocations to minimum).
If you need to reclaim memory during run you can call fini_tls
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Jim,
thanks for the advice, I'll try your code. Actually, I would have expected the compiler to automatically generate something like this usingthe reduction clause on an *allocatable* array (except for maybe, that the arrays would have been allocated and deallocated eachentry and exit). I still believe the error I reported above is a compiler bug.
Thanks,
Lars
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Depending on the execution time of the parallel region the allocation/deallocation time may be insignificant.
What is significant though is high frequency allocation/deallocation of large memory blocks tend to fragment memory. This results in your application growing in virtual memory footprint thus consuming more memory resources and potentially incurring additional paging. One of those situations were the program runs file during test/debug but sucks during production runs.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you very much, it works flawlessly now.
Lars
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - Lars Pastewka
Thank you very much, it works flawlessly now.
Lars
Lars,
For safety reasons I suggest you change
real(8) :: abs_dr, dr(3), df(3), q_i
to
real(8) :: abs_dr, q_i
real(8), automatic::dr(3), df(3)
Without automatic, then depending on option switches dr(3) and df(3) will be SAVE.
Some here prefer marking the subroutine as RECURSIVE.
I will follow this post with a suggestion that will eek a little more performance out of the routine.
Jim Dempsey
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The following code is a trade-off between using tls%phi and tls%E in the inner loop against recreating array descriptors in a call made in the outer loop. The tradeoff may or may not yield better performance.
I have not compiled and run the code. Prior to going to the effort to test the following code, compile the one you have now (with tls%phi and tls%E) with full optimizations enabled in a Debug configuration. Place a break point inside the inner loop and see if the optimization was smart enough to pull the Thread Local Storage dereferenceingout the loop. If it does, then good for the IVF compiler team. If not, then try the following code
[cpp]! ======================================= subroutine coulombThread(i, nat, r, q, phi, E, epot) use mod_tls use omp_lib implicit none integer, intent(in) :: i integer, intent(in) :: nat real(8), intent(in) :: r(3, nat) real(8), intent(in) :: q(nat) real(8), intent(inout) :: phi(nat) real(8), intent(inout) :: E(3, nat) real(8), intent(inout) :: epot ! -- integer :: j, i_stride real(8) :: abs_dr, q_i real(8), automatic :: dr(3), df(3) ! --- ! phi, E, epot cleared prior to entry ! Note, what this subroutine refers to ! as phi and E internally are externally ! tls%phi and tls%E q_i = q(i) do j = i+1, nat dr(:) = r(:, i) - r(:, j) abs_dr = sqrt(dot_product(dr(:), dr(:))) df(:) = dr(:)/(abs_dr**3) E(:, i) = E(:, i) + q(j)*df(:) E(:, j) = E(:, j) - q_i*df(:) phi(i) = phi(i) + q(j)/abs_dr phi(j) = phi(j) + q_i/abs_dr epot = epot + q_i*q(j)/abs_dr enddo endsubroutine coulombThread subroutine coulomb(nat, r, q, phi, E, epot) use mod_tls use omp_lib implicit none integer, intent(in) :: nat real(8), intent(in) :: r(3, nat) real(8), intent(in) :: q(nat) real(8), intent(inout) :: phi(nat) real(8), intent(inout) :: E(3, nat) real(8), intent(inout) :: epot ! -- integer :: i, j ! --- !$omp parallel default(none) & !$omp& private(j) & !$omp& shared(nat, r, q, phi, E, epot) ! initialize (if necessary) Thread Local Storage ! copies of phi and E in tls%phi and tls%E ! tls%phi, tls%E, cleared in init_phi_E call init_phi_E(nat) !$omp do & !$omp& reduction(+:epot) do i = 1, nat-1 ! call coulombThread using tls%phi, tls%E ! removes some overhead in referencing arrays call coulombThread(i, nat, r, q, tls%phi, tls%E, epot) enddo ! non-parallel do ! run in parallel by each thread ! (performs reduction without requirement of atomics) do i=0, omp_get_num_threads()-1 !$omp barrier do j = 1 + mod(i + omp_get_thread_num(), & & omp_get_num_threads()), & & nat, omp_get_num_threads() E(:,j) = E(:,j) + tls%E(:,j) phi(j) = phi(j) + tls%phi(j) enddo enddo !$omp end parallel endsubroutine coulomb [/cpp]
Jim Dempsey

Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page