Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Highlighted
New Contributor I
4 Views

OpenMP forrtl: severe (170): Program Exception - stack overflow

Jump to solution

I am attempting to parallelize a serial code, where there are several functions and subroutines. All of them were written to work only with arguments (PURE). Intention is to share a big real*4 array sized any where between 200MB to 4GB across 4~20 threads for parallel execution. Current serial execution is compiled with heap, so no issues.

Coming to the issue, when I attempt to execute the /Qopenmp compiled executable, I get the forrtl: severe (170): Program Exception - stack overflow error. In debug mode when entering first subroutine within parallel region, I see error message while breaking:

Unhandled exception at 0x00007FF60CEC0C18 in test.exe: 0xC00000FD: Stack overflow (parameters: 0x0000000000000001, 0x000000B23C113000).

Exception thrown at 0x00007FF60CEC0C18 in test.exe: 0xC0000005: Access violation writing location 0x000000B23C110000.

If there is a handler for this exception, the program may be safely continued.

Compiler options used are - /nologo /debug:full /Od /warn:all /traceback /check:bounds /check:stack /libs:dll /threads /dbglibs /Qopenmp

Am I missing anything here?

Out of curiosity, I did try with OMP_SET_NUM_THREADS(1), still it throws the same error.

0 Kudos

Accepted Solutions
Highlighted
4 Views

>>share a big real*4 array

Jump to solution

>>share a big real*4 array sized any where between 200MB to 4GB across 4~20 threads for parallel execution

Shared arrays are passed by reference, if the array exists prior to the parallel region, then there should be no issue with sharing this array.

Note, if this array is used in an array expression that requires a temporary to be created, then you may run into stack capacity issues. When this is the case, it can be mitigated by:

      adding the option /heap-arrays
      change very large automatic arrays into allocatable arrays
      changing the offending array expression(s) into explicit DO loop(s)
      OMP_STACKSIZE=nnnn[B|K|M|G|T](K) OpenMP created additional threads (not main thread)
      KMP_STACKSIZE=nnnn[B|K|M|G|T](K) OpenMP created additional threads (not main thread)

Don't go overboard with stack size.

While /heap-arrays may solve your problem, it may(will) also introduce additional overhead. You will tend to get best performance with a combination of the other approaches. (at the expense of a little more programming)

Jim Dempsey

View solution in original post

0 Kudos
6 Replies
Highlighted
5 Views

>>share a big real*4 array

Jump to solution

>>share a big real*4 array sized any where between 200MB to 4GB across 4~20 threads for parallel execution

Shared arrays are passed by reference, if the array exists prior to the parallel region, then there should be no issue with sharing this array.

Note, if this array is used in an array expression that requires a temporary to be created, then you may run into stack capacity issues. When this is the case, it can be mitigated by:

      adding the option /heap-arrays
      change very large automatic arrays into allocatable arrays
      changing the offending array expression(s) into explicit DO loop(s)
      OMP_STACKSIZE=nnnn[B|K|M|G|T](K) OpenMP created additional threads (not main thread)
      KMP_STACKSIZE=nnnn[B|K|M|G|T](K) OpenMP created additional threads (not main thread)

Don't go overboard with stack size.

While /heap-arrays may solve your problem, it may(will) also introduce additional overhead. You will tend to get best performance with a combination of the other approaches. (at the expense of a little more programming)

Jim Dempsey

View solution in original post

0 Kudos
Highlighted
New Contributor I
4 Views

I allocated the big array and

Jump to solution

I allocated the big array and declared shared while entering parallel region. Even private variables are allocated before entering the parallel region.

Will try with /heap and allocating stacks.

I'm not sure about offending array expressions. Will you be able to explain it a little?

0 Kudos
Highlighted
4 Views

Fortran permits array

Jump to solution

Fortran permits array expressions

ArrayOut = sqrt(ArrayIn1**2 + ArrayIn2**2 + ArrayIn3**2)

Which will create 3 or 4 array temporaries.  And without -heap-arrays, the compiler will place these on stack. When the temporaries are quite large, you will experience stack overflow. This statement can be replaced with a small loop that eliminates the array temporaries.

DO I=1,UBOUND(ArrayOut)
  ArrayOut(I) = sqrt(ArrayIn1(I)**2 + ArrayIn2(I)**2 + ArrayIn2(I)**2)
END DO

Jim Dempsey

0 Kudos
Highlighted
New Contributor I
4 Views

Jim, it worked for me. All I

Jump to solution

Jim, it worked for me. All I missed was setting OMP_STACKSIZE for runtime. Many thanks to you. And I'm not using any offending array expressions,  in fact I ensured that no temporary array was generated by switching on the runtime warnings.

And it made wonders for arrays of smaller sizes in range of 200MB even with 12 threads. To minimize the overhead I left the scheduling to default. Increase in speed was almost equal to number of threads involved. But for larger array of 4GB, again I got stack overflow error, even with 2 threads on. Is there any way of dynamically setting the number of threads based on available stack size per thread in windows x64?

0 Kudos
Highlighted
4 Views

I would not reduce the number

Jump to solution

I would not reduce the number of threads based on array sizes. Generally, larger data benefits from more threads.

If you can isolate the excessive stack consumption to a specific subroutine, and you desire stack allocation when using small/medium array sized, and resort to heap for large/huge array sizes, then I suggest you adapt your code somewhat like this:

subroutine foo(N, A, B)
 implicit none
 real :: A(N), B(N) ! dummies/no allocation
 if(N > useHeap) then
    call foo_heap(N, A, B)
 else
    call foo_stack(N, A, B)
 endif
 contains
  subroutine foo_heap(N, A, B)
    implicit none
    real :: A(N), B(N) ! dummies/no allocation
    real, allocatable :: work(:)
    allocate work(N)
    call foo_either(N, A, B, WORK)
  end subroutine foo_heap
  subroutine foo_stack(N, A, B)
    implicit none
    real :: A(N), B(N) ! dummies/no allocation
    real :: work(N) ! stack
    call foo_either(N, A, B, WORK)
  end subroutine foo_heap
  subroutine foo_either(N, A, B, work)
    implicit none
    real :: A(N), B(N), work(N) ! dummies/no allocation
    ... ! code here
  end subroutine foo_either
end subroutine foo

FWIW The original intent of /heap-arrays:nK was to do the equivalent to the above, however, it has been reported here that the nK feature doesn't work well.

Jim Dempsey

0 Kudos
Highlighted
4 Views

Alternative:

Jump to solution

Alternative:

subroutine foo(N, A, B)
  implicit none
  real :: A(N), B(N) ! dummies/no allocation
  real, SAVE, allocatable :: work(:)
  !$omp threadprivate(work)
  if(allocated(work)) then
    if(N > size(work)) then
      deallocate(work)
      allocate(work(N))
    endif
  else
   allocate(work(N))
  endif
  ...
end subroutine foo

The choice would depend upon if you need to reclaim the heap upon the return from foo

**** CAUTION

use:  work(1:N) instead of work alone.

Should you call foo with N less than size(work), then using WORK alone expresses more data, as well as when used as WORK = then work will get reallocated and defeats your optimization efforts.

Jim Dempsey

0 Kudos