Solved: openmp large private arrays crash

yenikaya__bayram · ‎04-01-2019

Hi,

The following code crashes even when using single thread. Compilation is done with -heap_array options. The problem is caused by v3 being private (if v3 is shared no crash). As code indicates the allocation of v3 is successful, the threaded part of the program crashes. This happens even when only one thread is running. Increasing OMP_STACKSIZE does not solve the problem, it seems that the global stacksize is exceeding the limit. Using unlimted stacksize does not crash, but end user does not want to use that option. Why is the program using global stacksize in the openmp loop? The same program without OpenMP directives does not crash.

What is the solution here? How can we force the private array to stay in heap? Or make threads to use thread stacksize and not global stacksize.

The PGI fortran compiler does not crash by the way, they have implemented a solution that works. What is ifort solution for this problem? Don't say unlimited stacksize. I don't know how big the arrays can get in advance.

program my_prog
implicit none
integer, parameter:: M = 1800000

call test_ifort_openmp(M);

end program my_prog

subroutine test_ifort_openmp(M)
implicit none
integer, intent(in) :: M

COMPLEX*8 :: U(M)
integer :: i
COMPLEX*8, allocatable :: v3(:)

!$ INTEGER omp_get_num_threads
!$ EXTERNAL omp_get_num_threads

allocate (v3(M));
U = 0.0;
v3 = 0.0;

print *, 'allocation successful'

!$ CALL omp_set_num_threads(1)

!$omp parallel do ordered private(v3) &
!$omp shared (U,M) &
!$omp default(none) &
!$omp schedule(dynamic)
DO i=1,10 !---------------------main loop
!$ print *, 'num threads', omp_get_num_threads()
v3 = exp(cmplx(0.0_4,1.0_4));
!$omp ordered
U = U + v3
!$omp end ordered
END DO !---------------------end main loop
!$omp end parallel do

print *,'done ordered summation'

end subroutine test_ifort_openmp

jimdempseyatthecove · ‎04-02-2019

program my_prog  
  implicit none   
  integer, parameter:: M = 1800000

  call test_ifort_openmp(M);
  
end program my_prog

subroutine test_ifort_openmp(M)
  implicit none   
  integer, intent(in) :: M
  
  COMPLEX*8 :: U(M)
  integer :: i
  COMPLEX*8, allocatable :: v3(:)

!$ INTEGER  omp_get_num_threads
!$ EXTERNAL omp_get_num_threads

!**** do later  allocate (v3(M));  
  U = 0.0;
 !**** do later v3 = 0.0;

  print *, 'allocation successful'

!$ CALL omp_set_num_threads(1)  

! **** start parallel region WITHOUT d
! **** change private(v3) to firstprivate(v3)...
! **** to copy in the unallocated array descriptor
!$omp parallel firstprivate(v3)            &
!$omp shared (U,M)                               &
!$omp default(none)
! **** do the allocation inside the parallel region
  allocate (v3(M))
  v3 = 0.0
! *** now the DO without parallel
!$omp do ordered                              &
!$omp schedule(dynamic)               
  DO i=1,10 !---------------------main loop
!$ print *, 'num threads', omp_get_num_threads()
     v3 = exp(cmplx(0.0_4,1.0_4));
!$omp ordered
     U = U + v3
!$omp end ordered
  END DO !---------------------end main loop
! *** omit the parallel
!$omp end do 
  deallocate(v3)
!*** now exit parallel
!$omp end parallel
  print *,'done ordered summation'

end subroutine test_ifort_openmp

Jim Dempsey

View solution in original post

Jeremie_V_ · ‎04-02-2019

Dear,

Regarding the stack (under Linux at least), the main thread uses the main stack (for which the size can be modified using the command ulimit) while the slave threads have their own thread stack. The size of these latter ones can be modified with the env variable OMP_STACKSIZE.

A solution to avoid the use of stacks is to manually allocate an array with the same number of rows as the array in the private statement and with the same number of columns as the number of threads. This array is then used in the loop instead of the private array.

Yours sincerely,

Jeremie

FortranFan · ‎04-02-2019

yenikaya, bayram wrote:
..
COMPLEX*8 :: U(M)
..

Do you really mean for an automatic array of such a big size on the stack, as opposed to a scalar, for a reduction result of your ordered summation?

yenikaya__bayram · ‎04-02-2019

This is not the real program, just a testcase to demonstrate the crash.

In real program the reduction is in an array, but each thread needs to use some large array as private for computation.

Can Intel compiler put the “large private arrays” used by each thread on the heap instead of stack? It seems that it is putting private arrays in stack, and that requires a large stacksize. Our end user do not like that.

I know I can allocate as much memory as needed for each thread, but that requires lots of code change. I have a code that works with PGI without any problems and crashes with Intel fortran. It looks like PGI fortran is putting private arrays on the heap, but intel is putting them on the stack. Is there a way to tell intel compiler to put them on the heap? -heap_arrays does not do it.

FortranFan · ‎04-02-2019

yenikaya, bayram wrote:
.. intel is putting them on the stack. Is there a way to tell intel compiler to put them on the heap? -heap_arrays does not do it.

Contact Intel support? https://supporttickets.intel.com/?lang=en-US

They might be able to explain why this automatic array is not allocated on the heap with -heap-arrays option, or look into any issues/bugs with the option.

jimdempseyatthecove · ‎04-02-2019

program my_prog  
  implicit none   
  integer, parameter:: M = 1800000

  call test_ifort_openmp(M);
  
end program my_prog

subroutine test_ifort_openmp(M)
  implicit none   
  integer, intent(in) :: M
  
  COMPLEX*8 :: U(M)
  integer :: i
  COMPLEX*8, allocatable :: v3(:)

!$ INTEGER  omp_get_num_threads
!$ EXTERNAL omp_get_num_threads

!**** do later  allocate (v3(M));  
  U = 0.0;
 !**** do later v3 = 0.0;

  print *, 'allocation successful'

!$ CALL omp_set_num_threads(1)  

! **** start parallel region WITHOUT d
! **** change private(v3) to firstprivate(v3)...
! **** to copy in the unallocated array descriptor
!$omp parallel firstprivate(v3)            &
!$omp shared (U,M)                               &
!$omp default(none)
! **** do the allocation inside the parallel region
  allocate (v3(M))
  v3 = 0.0
! *** now the DO without parallel
!$omp do ordered                              &
!$omp schedule(dynamic)               
  DO i=1,10 !---------------------main loop
!$ print *, 'num threads', omp_get_num_threads()
     v3 = exp(cmplx(0.0_4,1.0_4));
!$omp ordered
     U = U + v3
!$omp end ordered
  END DO !---------------------end main loop
! *** omit the parallel
!$omp end do 
  deallocate(v3)
!*** now exit parallel
!$omp end parallel
  print *,'done ordered summation'

end subroutine test_ifort_openmp

Jim Dempsey

jimdempseyatthecove · ‎04-02-2019

Or:

program my_prog  
  implicit none   
  integer, parameter:: M = 1800000

  call test_ifort_openmp(M);
  
end program my_prog

subroutine test_ifort_openmp(M)
  implicit none   
  integer, intent(in) :: M
  
  COMPLEX*8 :: U(M)
  integer :: i

!$ INTEGER  omp_get_num_threads
!$ EXTERNAL omp_get_num_threads
  U = 0.0;
  print *, 'allocation successful'

!$ CALL omp_set_num_threads(1)  
!$omp parallel
  call doWork()
!$omp end parallel
  print *,'done ordered summation'

contains

subroutine doWork()
  implicit none   
  integer :: i
  COMPLEX*8, allocatable :: v3(:)
  allocate (v3(M))
  v3 = 0.0
!$omp do ordered schedule(dynamic)               
  DO i=1,10 !---------------------main loop
!$ print *, 'num threads', omp_get_num_threads()
     v3 = exp(cmplx(0.0_4,1.0_4));
!$omp ordered
     U = U + v3
!$omp end ordered
  END DO !---------------------end main loop
!$omp end do 
  deallocate(v3)	! optional

end subroutine doWork
end subroutine test_ifort_openmp

Jim Dempsey

yenikaya__bayram · ‎04-02-2019

Thanks Jim, great idea. How can I do it if v3 was an input array? Defined like this.

COMPLEX*8, intent(inout) :: v3(M)

or automatic array defined like this

COMPLEX*8 :: v3(M)

jimdempseyatthecove · ‎04-03-2019

>>Thanks Jim, great idea. How can I do it if v3 was an input array?

This depends on how you access V3. IIF (If and only if) V3, and U for that matter are accesses as such above (each element in V3 updates corresponding elements in U and each element in V3 is independent of any other elements in V3...

program my_prog  
  implicit none   
  integer, parameter:: M = 1800000

  call test_ifort_openmp(M);
  
end program my_prog

subroutine test_ifort_openmp(M)
  use omp_lib            !**** add this
  implicit none   
  integer, intent(in) :: M
  
  COMPLEX*8 :: U(M)
  integer :: i
  COMPLEX*8, allocatable :: v3(:)

  allocate (v3(M));  ! *** size(U) == size(V3) .and. M large enough to slice
  U = 0.0;
  v3 = 0.0;

  print *, 'allocation successful'

!$ CALL omp_set_num_threads(1)  ! *** use more threads??

! **** start parallel region
!$omp parallel shared(v3)           &
!$omp shared (U,M)                  &
!$omp private(i,j)                  &
!$omp default(none)
  DO i=1,10 !---------------------main loop
! Note i is private, this loop contains omp do, and omp do has
! implicit barrier at end. Therefor each threads i will be in lock-step
!$omp do   &
!$omp schedule(static)  
    DO j=1,M             

     v3(J) = exp(cmplx(0.0_4,1.0_4));
     U(J) = U(J) + v3(J)
    END DO ! ------- end slice of V3 and U
!$omp end do 
  END DO !---------------------end main loop
!$omp end parallel
  print *,'done slice summation'
  deallocate(v3) ! optional with newer versions of Fortran

end subroutine test_ifort_openmp

Jim Dempsey

jimdempseyatthecove · ‎04-03-2019

IOW your initial way was to do batches in parallel with ordered accumulation (parallel ordered pipeline-esk)

The #9 method is to do batches in order and parallel slice each batch. Also note in the #9 case the main loop can reside within the parallel region thus reducing the number of parallel regions entry/exit. Your actual program may need additional programming considerations

   END DO ! ------- end slice of V3 and U
!$omp end do 
!$omp master
... write something to output
!$omp end master
  END DO !---------------------end main loop
!$omp end parallel

The selection of method greatly depends on size of M (and if internal algorithm index varying dependencies within V3, U, ...)

Jim Dempsey

yenikaya__bayram · ‎04-03-2019

Here is a more realistic case.
v3 is passed in because most times this code is ran single threaded,
we want to reuse the memory instead of allocate deallocate. 
v3 must be private when threaded. 
The following code crashes due to v3 being put on stack.





subroutine test_ifort_openmp4(M,k,H,U,v3)
  implicit none   
  integer, intent(in) :: M,k
  
  COMPLEX*8, intent(out) :: U(M)
  COMPLEX*8, intent(in)  :: H(M,k)
  COMPLEX*8, intent(out) :: v3(M)

  integer :: i
  U = 0.0;
  v3 = 0.0;

!$omp parallel do ordered private(v3)            &
!$omp shared (U,M,H,k)                           &
!$omp default(none)                              &
!$omp schedule(dynamic)               
  DO i=1,k !---------------------main loop
     call fft(H(:,i), v3);
!$omp ordered
     U = U + v3*conjg(v3);
!$omp end ordered
  END DO !---------------------end main loop
!$omp end parallel do 

end subroutine test_ifort_openmp4

jimdempseyatthecove · ‎04-03-2019

>>v3 must be private when threaded.

conflicts with

>>we want to reuse the memory instead of allocate deallocate.

>>U = U + v3*conjg(v3);

Depending on the version of the compiler .AND. optimization used/not used, the above may (or may not) be performed without creating a temporary array (on stack when -heap-arrays not used). This may aggregate your crash situation.

There is a runtime check option that will indicate if an array temporary is created.

If it is created, then you can manually bypass the array temporary by expressing that as a DO loop using V3(index)%RE and %IM...
... however this should not be necessary as the compiler optimization (now-a-days) should be able to avoid the array temporary and vectorize this code.

Consider this:

subroutine test_ifort_openmp4(M,k,H,U,v3)
  implicit none   
  integer, intent(in) :: M,k
  
  COMPLEX*8, intent(out) :: U(M)
  COMPLEX*8, intent(in)  :: H(M,k)
  COMPLEX*8, intent(out) :: v3(M)
!$OMP THREADPRIVATE(privateV3)
  COMPLEX*8, ALLOCATABLE, SAVE :: privateV3(:) ! each thread will have different copy
  integer :: i

! *** do NOT use: IF(.not.allocated(privateV3) .or. size(privateV3) .lt. M)
! *** Fortran is NOT C and will produce results of both sides of .or.
! Now allocate only when necessary
  if(.not.allocated(privateV3)) then
    ALLOCATE(privateV3(M))
  else
    if(size(privateV3) .lt. M) then
      DEALLOCATE(privateV3)
      ALLOCATE(privateV3(M))
    endif
  endif
  U = 0.0			! remove the ;
  privatev3(1:M) = 0.0D0	! privateV3 may be larger than M

!$omp parallel do ordered      &
!$omp shared(v3, privatev3)    &
!$omp shared (U,M,H,k)         &
!$omp default(none)            &
!$omp schedule(dynamic)               
  DO i=1,k !---------------------main loop
     call fft(H(:,i), privatev3);
!$omp ordered
     U = U + privatev3(1:M)*conjg(privatev3(1:M));
!$omp end ordered
     if(i .eq. k) V3 = privatev3(1:M)
  END DO !---------------------end main loop
!$omp end parallel do 
! *** Note, privateV3 hangs around for next call
! *** will show up as memory leak with valgrind (which you can ignore)
end subroutine test_ifort_openmp4

Jim Dempsey

yenikaya__bayram · ‎04-03-2019

Thanks Jim. This gives enough ideas to develop a workaround.

Memory reuse for v3 is for non-openmp applications.

jimdempseyatthecove · ‎04-04-2019

>>This may aggregate your crash situation

giggling.....

In some sense, in light of the program being parallel, aggregate may be a better choice of words than aggravate.

BTW the effort is not a workaround. Rather it is an effort to properly, and efficiently, parallelize this procedure.

Good luck in your efforts.

Jim Dempsey

yenikaya__bayram · ‎04-04-2019

Thanks. BTW, PGI compiler does this properly. "Put array on the heap" means put array on the heap. When heap array is enabled by PGI compiler, all local arrays and all arrays private to threads are put on the heap.

Intel compiler does not put local arrays private to each thread on the heap, when you use -heap_arrays. If they did that properly, we would not need to try so hard to find a solution.