Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.

OpenMP forrtl: severe (170): Program Exception - stack overflow

mohanmuthu
New Contributor I
4,562 Views

I am attempting to parallelize a serial code, where there are several functions and subroutines. All of them were written to work only with arguments (PURE). Intention is to share a big real*4 array sized any where between 200MB to 4GB across 4~20 threads for parallel execution. Current serial execution is compiled with heap, so no issues.

Coming to the issue, when I attempt to execute the /Qopenmp compiled executable, I get the forrtl: severe (170): Program Exception - stack overflow error. In debug mode when entering first subroutine within parallel region, I see error message while breaking:

Unhandled exception at 0x00007FF60CEC0C18 in test.exe: 0xC00000FD: Stack overflow (parameters: 0x0000000000000001, 0x000000B23C113000).

Exception thrown at 0x00007FF60CEC0C18 in test.exe: 0xC0000005: Access violation writing location 0x000000B23C110000.

If there is a handler for this exception, the program may be safely continued.

Compiler options used are - /nologo /debug:full /Od /warn:all /traceback /check:bounds /check:stack /libs:dll /threads /dbglibs /Qopenmp

Am I missing anything here?

Out of curiosity, I did try with OMP_SET_NUM_THREADS(1), still it throws the same error.

0 Kudos
1 Solution
jimdempseyatthecove
Honored Contributor III
4,562 Views

>>share a big real*4 array sized any where between 200MB to 4GB across 4~20 threads for parallel execution

Shared arrays are passed by reference, if the array exists prior to the parallel region, then there should be no issue with sharing this array.

Note, if this array is used in an array expression that requires a temporary to be created, then you may run into stack capacity issues. When this is the case, it can be mitigated by:

      adding the option /heap-arrays
      change very large automatic arrays into allocatable arrays
      changing the offending array expression(s) into explicit DO loop(s)
      OMP_STACKSIZE=nnnn[B|K|M|G|T](K) OpenMP created additional threads (not main thread)
      KMP_STACKSIZE=nnnn[B|K|M|G|T](K) OpenMP created additional threads (not main thread)

Don't go overboard with stack size.

While /heap-arrays may solve your problem, it may(will) also introduce additional overhead. You will tend to get best performance with a combination of the other approaches. (at the expense of a little more programming)

Jim Dempsey

View solution in original post

0 Kudos
11 Replies
jimdempseyatthecove
Honored Contributor III
4,563 Views

>>share a big real*4 array sized any where between 200MB to 4GB across 4~20 threads for parallel execution

Shared arrays are passed by reference, if the array exists prior to the parallel region, then there should be no issue with sharing this array.

Note, if this array is used in an array expression that requires a temporary to be created, then you may run into stack capacity issues. When this is the case, it can be mitigated by:

      adding the option /heap-arrays
      change very large automatic arrays into allocatable arrays
      changing the offending array expression(s) into explicit DO loop(s)
      OMP_STACKSIZE=nnnn[B|K|M|G|T](K) OpenMP created additional threads (not main thread)
      KMP_STACKSIZE=nnnn[B|K|M|G|T](K) OpenMP created additional threads (not main thread)

Don't go overboard with stack size.

While /heap-arrays may solve your problem, it may(will) also introduce additional overhead. You will tend to get best performance with a combination of the other approaches. (at the expense of a little more programming)

Jim Dempsey

0 Kudos
bern
Beginner
3,990 Views

Hello Jim,

I am running a Fortran program using Visual Studio 2019 in a Windows 10 PC. I can successfully build both debug and release executables. However, when I run the dataset, both executables complain about either stack overflow or access violation. When I run in debug mode, I get notifications of unhandled exceptions:

Unhandled exception at 0x00007FF757662F57 in myprog.exe: 0xC0000005: Access violation writing location 0x000000A8BEDFF000.

I read in some forums that one solution would be to increase the size of the stack and heap arrays, and I have done that by adding 1 Mb (1048576).  I have tried other solutions found on the internet, but I am unable to find a stable solution.

Another suggested solution was to install oneAPI Base Toolkit, which I've done.

I should mention that I know little about these issues, so any help will be appreciated.  

Bern

 

 

0 Kudos
mohanmuthu
New Contributor I
4,562 Views

I allocated the big array and declared shared while entering parallel region. Even private variables are allocated before entering the parallel region.

Will try with /heap and allocating stacks.

I'm not sure about offending array expressions. Will you be able to explain it a little?

0 Kudos
jimdempseyatthecove
Honored Contributor III
4,561 Views

Fortran permits array expressions

ArrayOut = sqrt(ArrayIn1**2 + ArrayIn2**2 + ArrayIn3**2)

Which will create 3 or 4 array temporaries.  And without -heap-arrays, the compiler will place these on stack. When the temporaries are quite large, you will experience stack overflow. This statement can be replaced with a small loop that eliminates the array temporaries.

DO I=1,UBOUND(ArrayOut)
  ArrayOut(I) = sqrt(ArrayIn1(I)**2 + ArrayIn2(I)**2 + ArrayIn2(I)**2)
END DO

Jim Dempsey

0 Kudos
mohanmuthu
New Contributor I
4,561 Views

Jim, it worked for me. All I missed was setting OMP_STACKSIZE for runtime. Many thanks to you. And I'm not using any offending array expressions,  in fact I ensured that no temporary array was generated by switching on the runtime warnings.

And it made wonders for arrays of smaller sizes in range of 200MB even with 12 threads. To minimize the overhead I left the scheduling to default. Increase in speed was almost equal to number of threads involved. But for larger array of 4GB, again I got stack overflow error, even with 2 threads on. Is there any way of dynamically setting the number of threads based on available stack size per thread in windows x64?

0 Kudos
jimdempseyatthecove
Honored Contributor III
4,561 Views

I would not reduce the number of threads based on array sizes. Generally, larger data benefits from more threads.

If you can isolate the excessive stack consumption to a specific subroutine, and you desire stack allocation when using small/medium array sized, and resort to heap for large/huge array sizes, then I suggest you adapt your code somewhat like this:

subroutine foo(N, A, B)
 implicit none
 real :: A(N), B(N) ! dummies/no allocation
 if(N > useHeap) then
    call foo_heap(N, A, B)
 else
    call foo_stack(N, A, B)
 endif
 contains
  subroutine foo_heap(N, A, B)
    implicit none
    real :: A(N), B(N) ! dummies/no allocation
    real, allocatable :: work(:)
    allocate work(N)
    call foo_either(N, A, B, WORK)
  end subroutine foo_heap
  subroutine foo_stack(N, A, B)
    implicit none
    real :: A(N), B(N) ! dummies/no allocation
    real :: work(N) ! stack
    call foo_either(N, A, B, WORK)
  end subroutine foo_heap
  subroutine foo_either(N, A, B, work)
    implicit none
    real :: A(N), B(N), work(N) ! dummies/no allocation
    ... ! code here
  end subroutine foo_either
end subroutine foo

FWIW The original intent of /heap-arrays:nK was to do the equivalent to the above, however, it has been reported here that the nK feature doesn't work well.

Jim Dempsey

0 Kudos
jimdempseyatthecove
Honored Contributor III
4,569 Views

Alternative:

subroutine foo(N, A, B)
  implicit none
  real :: A(N), B(N) ! dummies/no allocation
  real, SAVE, allocatable :: work(:)
  !$omp threadprivate(work)
  if(allocated(work)) then
    if(N > size(work)) then
      deallocate(work)
      allocate(work(N))
    endif
  else
   allocate(work(N))
  endif
  ...
end subroutine foo

The choice would depend upon if you need to reclaim the heap upon the return from foo

**** CAUTION

use:  work(1:N) instead of work alone.

Should you call foo with N less than size(work), then using WORK alone expresses more data, as well as when used as WORK = then work will get reallocated and defeats your optimization efforts.

Jim Dempsey

0 Kudos
jimdempseyatthecove
Honored Contributor III
3,957 Views

Without having your program for analysis it is difficult to provide anything other than a best guess...

>>Unhandled exception at 0x00007FF757662F57

The program virtual address (code) above is located at ~140.7TB (terabyte). This is either:

a) an invalid user program address, or

b) a Windows O/S system address

 

>>Access violation writing location 0x000000A8BEDFF000

The data location in decimal is: 724,756,852,736 or 724.7 GB (gigabyte)

This may indicate either:

a) an errant address was used, or

b) a valid address was used (humoungous allocation) .AND. the system page file size was exceeded.

Situation b) can be perplexing given that an allocation will succeed (virtual address taken) but the error only occurs later when the actual page file page is allocated upon first touch of the data .AND. (the page file has been exhausted .OR. a program limit has been reached). 

Does the size of the data seem appropriate?

Jim Dempsey

 

Is 724.7GB the expected 

 

 

0 Kudos
bern
Beginner
3,892 Views

Thank you, Jim!

 

The Fortran code I am running is an terrestrial ecosystem model, which is used by many colleagues. To me, it is a question of system address. The data allocation of 724.7 Gb is not the expected. I am including a screenshot of where the problem occurs. It seems to me that the solution is either a question of Visual Studio 2019 or Windows O/S setting.   

0 Kudos
jimdempseyatthecove
Honored Contributor III
3,860 Views

The error is occurring at entry to TGRAZ where it is allocating the local arrays. I suspect this is exceeding available stack space.

I suggest that you add IMPLICIT NONE, then define the (local) arrays with proper type, *** but make these allocatable.
Keep old source lines (DIMENSION...) as comments, then insert the ALLOCATE statements as necessary.

This will accomplish a few things:

1) Require you to declare types used by the procedure (avoids typing errors, either by you or by earlier developer).

2) Delays any runtime error (due to oversized allocations) such that you can a) note where the error occurs, b) permit you to insert diagnostic code (for example, for without IMPLICIT NONE, a mistyped array dimension will be implicitly declared, but undefined).

Jim Dempsey

0 Kudos
bern
Beginner
3,825 Views

Thank you, Jim. I will give it a try and let you know the results.

Bern

0 Kudos
Reply