Stack with !$OMP

John_Campbell · ‎06-29-2018

I am struggling to manage the stack !

I have come to the view that stack overflow errors should not exist; that it is laziness of the operating system not to provide a stack overflow extension. I don't expect to win on this one soon, so I need to understand the existing stack management approaches.　

Often, the answer to stack overflow errors is to make the stack bigger, but to make it bigger you need to know how big is the stack initially ?
Often I have found this to be a difficult question to answer. Am I stupid because I can’t find the right documentation !
Over the years using many different Fortran compilers, my best approach has always been to avoid the stack, so use ALLOCATE or (before F90) store/allocate arrays in COMMON.　

Now for my new question : Are local / automatic arrays better than ALLOCATE arrays in !$OMP and can I show this difference ?　

The background to this question is that PRIVATE arrays on separate pages for each thread should produce fewer "cache coherence" problems. My previous approach was to ALLOCATE arrays ( presumably on a single heap ) which does not have the risk of stack overflow problems, but are potentially more likely to have the problem of "cache coherence"
To minimise potential stack overflow issues, I have also used the approach of ALLOCATE ( array(n,m,0:num_threads) ), with array as SHARED. Again, this could have more expected "cache coherence" problems (unless bytes*n*m is page size)

Recently I have been experimenting with use of "the stack" with !$OMP (in gFortran. Is iFort different ?). I was hoping to show that each thread had a separate stack, so that private variables and arrays would be on a separate "memory page" and not produce "cache coherence" problems.

How do I check this ?
I have tried to test this, as I have assumed there is a single heap where ALLOCATE arrays go, which can cause this problem.
I can use LOC to report the address of local automatic arrays or ALLOCATE arrays in subroutines called in a !$OMP region.
Unfortunately the coded test results have indicated there is no difference. Performance is little different and LOC addresses appear mixed between the threads, ie the results are not what I expected so my assumptions or interpretations are wrong !!

Is it possible from Fortran code to find out where the stack is and how big it is ( for each thread in a !$OMP region, if this is the case )
Does anyone have any advice or suggest any links to documentation that may explain this problem ?

I have attached a program I ran using gFortran which provided the inconclusive results I discussed. I have included the .f90 code, the .log run report and .xlsx analysis which shows the order of memory addresses for arrays, for OpenMP runs for local or ALLOCATE arrays. The results appear to show that all threads are mixed in the same memory area and not segregated by thread ID.

Steve_Lionel · ‎06-30-2018

Some general observations - OpenMP experts (I am not!) will undoubtedly chime in.

Some operating systems do have automatic stack expansion. OpenVMS does, but Windows does not. Linux doesn't seem to either. On Windows, the stack size is set by a 32-bit value (even in 64-bit applications) in the EXE file and can't be changed dynamically.

In threaded applications, each thread has its own stack. Whether this space comes out of the regular stack or is dynamically allocated is an implementation detail of the specific threading package. Because threaded applications share the same address space, the stack can't be expanded once established.

I'll leave the "cache coherence" part of your question to others.

JVanB · ‎06-30-2018

There is always GetCurrentThreadStackLimits. I tried to create an example with openmp but for some reason the program would never spawn any new threads. So much for my first attempt at openmp.

EDIT: I was misspelling as !OMP$. Too much directive-enhanced compilation :) So now it works:

! omp_test.f90
! gfortran -fopenmp omp_test.f90 -oomp_test
module M
   use ISO_C_BINDING
   implicit none
   private
   public GetCurrentThreadStackLimits
   interface
      subroutine GetCurrentThreadStackLimits(LowLimit,HighLimit) &
         bind(C,name='GetCurrentThreadStackLimits')
         import
         implicit none
!GCC$ ATTRIBUTES STDCALL:: GetCurrentThreadStackLimits
!DEC$ ATTRIBUTES STDCALL:: GetCurrentThreadStackLimits
         type(C_PTR) LowLimit
         type(C_PTR) HighLimit
      end subroutine GetCurrentThreadStackLimits
   end interface
   type, public :: T
      type(C_PTR) :: Address = C_NULL_PTR
      type(C_PTR) :: LowLimit = C_NULL_PTR
      type(C_PTR) :: HighLimit = C_NULL_PTR
   end type T
   type(T), allocatable, public :: Tarray(:)
   public S
   contains
      recursive subroutine S(i)
         use omp_lib
         integer i
         integer, target :: j
         j = OMP_GET_THREAD_NUM()
         Tarray(j)%Address = C_LOC(j)
         call GetCurrentThreadStackLimits(Tarray(j)%LowLimit,Tarray(j)%HighLimit)
      end subroutine S
end module M

program P
   use M
   use ISO_C_BINDING
   use omp_lib
   implicit none
   integer, parameter :: N = 100
   integer i
   integer, parameter :: Nthread = 10

   allocate(Tarray(0:Nthread-1))
   call OMP_SET_NUM_THREADS(Nthread)
   call OMP_SET_DYNAMIC(.FALSE.)
!$OMP PARALLEL DO     &
!$OMP DEFAULT(NONE)   &
!$OMP private(i)      &
!$OMP SHARED(Tarray)
   do i = 1, N
      call S(i)
   end do
!$OMP END PARALLEL DO
   do i = 0, Nthread-1
      write(*,*) 'Thread = ', i
      write(*,*) 'Address = ', transfer(Tarray(i)%address,0_C_INTPTR_T)
      write(*,*) 'LowLimit = ', transfer(Tarray(i)%LowLimit,0_C_INTPTR_T)
      write(*,*) 'HighLimit = ', transfer(Tarray(i)%HighLimit,0_C_INTPTR_T)
   end do
end program P

Output with gfortran:

 Thread =            0
 Address =               2357948
 LowLimit =                262144
 HighLimit =               2359296
 Thread =            1
 Address =              32767468
 LowLimit =              30670848
 HighLimit =              32768000
 Thread =            2
 Address =              34864620
 LowLimit =              32768000
 HighLimit =              34865152
 Thread =            3
 Address =              36961772
 LowLimit =              34865152
 HighLimit =              36962304
 Thread =            4
 Address =              39058924
 LowLimit =              36962304
 HighLimit =              39059456
 Thread =            5
 Address =              41156076
 LowLimit =              39059456
 HighLimit =              41156608
 Thread =            6
 Address =              43253228
 LowLimit =              41156608
 HighLimit =              43253760
 Thread =            7
 Address =              45350380
 LowLimit =              43253760
 HighLimit =              45350912
 Thread =            8
 Address =              47447532
 LowLimit =              45350912
 HighLimit =              47448064
 Thread =            9
 Address =              49544684
 LowLimit =              47448064
 HighLimit =              49545216

@John Campbell: I found out that the reason gfortran is putting your arrays on the heap while mine is putting my scalar on the stack is that gfortran always allocates local arrays with non-constant bounds on the heap unless the switch -fstack-arrays is in effect. So use this switch and perhaps try out my interface to GetCurrentThreadStackLimits and I think your program will behave more as expected.

John_Campbell · ‎06-30-2018

Steve & Repeat Offender,

Thanks for your replies. You have helped a great deal and given me something to work on.

I shall now test if having private arrays on separate memory pages for each thread does have a measurable effect on performance.
There are many things that can affect performance, we have to learn which are the more significant !

TimP · ‎07-01-2018

A small addition to what was said above: when we first started working with OpenMP on Windows95, the thread stack allocations were only 1MB by default. Among the problems posed were cache evictions due to mapping conflicts among threads. A 2MB thread stack was sufficient to solve this problem, and that has remained the default for 32-bit Windows. For 64-bit OS on the same CPUs, the default has been 4MB from the beginning, as 2MB is so frequently insufficient, and 4MB doesn't pose a problem with many threads using excessive stack unless possibly when hyperthreads are used. Any useable value would keep the stacks in separate 4KB pages so we needn't worry about a TLB eviction affecting another thread. I didn't try to use your data to see if the stacks are page aligned, which might be a small advantage. On a NUMA system we depend on the OS placing the stacks in local memory. It seems unlikely that large memory usage by an application would prevent a new thread starting up with a local stack of any reasonable size.

Steve_Lionel · ‎07-01-2018

FWIW, RO's program builds and runs fine in ifort 18. In ifort, you can USE KERNEL32 to get the declaration for GetCurrentThreadStackLimits, but RO's version works too. Here's the 64-bit version on my PC:

D:\Projects\Console6>ifort /Qopenmp console6.f90
Intel(R) Visual Fortran Intel(R) 64 Compiler for applications running on Intel(R) 64, 
Version 18.0.2.185 Build 20180210
Copyright (C) 1985-2018 Intel Corporation.  All rights reserved.

Microsoft (R) Incremental Linker Version 14.14.26431.0
Copyright (C) Microsoft Corporation.  All rights reserved.

-out:console6.exe
-subsystem:console
-defaultlib:libiomp5md.lib
-nodefaultlib:vcomp.lib
-nodefaultlib:vcompd.lib
console6.obj

D:\Projects\Console6>console6.exe
 Thread =            0
 Address =           691505787352
 LowLimit =           691504742400
 HighLimit =           691505790976
 Thread =            1
 Address =           691516012760
 LowLimit =           691511754752
 HighLimit =           691516014592
 Thread =            2
 Address =           691520272088
 LowLimit =           691516014592
 HighLimit =           691520274432
 Thread =            3
 Address =           691524532568
 LowLimit =           691520274432
 HighLimit =           691524534272
 Thread =            4
 Address =           691528791128
 LowLimit =           691524534272
 HighLimit =           691528794112
 Thread =            5
 Address =           691533052376
 LowLimit =           691528794112
 HighLimit =           691533053952
 Thread =            6
 Address =           691537311576
 LowLimit =           691533053952
 HighLimit =           691537313792
 Thread =            7
 Address =           691541571672
 LowLimit =           691537313792
 HighLimit =           691541573632
 Thread =            8
 Address =           691545831512
 LowLimit =           691541573632
 HighLimit =           691545833472
 Thread =            9
 Address =           691550091864
 LowLimit =           691545833472
 HighLimit =           691550093312

JVanB · ‎07-01-2018

Yeah, ifort doesn't seem to put local arrays with non-constant bounds on the heap by default as gfortran does, so my version that takes the address of an array (not posted) works by default in ifort.

But one thing I found out to my horror is that openmp relies on environmental variables to set the stack size for the threads it spawns. Of course there is no way to set up the stack size for the master thread, but there is only a MASTER directive, while to get only threads with known stack size one would need a NOMASTER directive. Apparently one has to recompile to change the master thread stack size. I suppose one could use CreateThread to start a new thread with known stack size (the program could inquire the stack size of openmp-spawned threads then use CreateThread to spawn a thread with this size) and then use that as the master thread, but it's shocking that openmp makes the programmer jump through hoops like that to avoid stack overflow. Do coarrays have the same issue?

Steve_Lionel · ‎07-01-2018

Coarrays don't have this issue - a coarray program is more like MPI where each "image" has its own storage. There's no "thread stack" involved. A typical coarray is dynamically allocated, though you can declare one to be a module or program variable like any other array. Unlike regular arrays, there is no concept of an "automatic" coarray.

John_Campbell · ‎07-01-2018

@Repeat Offender,

I can't get your example program to run. I am using 64-bit gFortran on Windows 7-64, but when building it does not find GetCurrentThreadStackLimits. Do you have any suggestions ?

I am also surprised/confused by the difficulty in changing the stack size for threads. If I am going to target local arrays on a unique stack, 4Mb is not enough so how can we change the stack size for all threads, including the master ? Using ALLOCATE on a shared heap appears more robust.

JVanB · ‎07-01-2018

It looks like you need Windows 8 for GetCurrentThreadStackLimits. Intel Fortran has KMP_SET_STACKSIZE_S to set the stack size of threads you will subsequently create, but it doesn't seem to be documented for gfortran. There is also the possibility of setting the OMP_STACKSIZE environmental variable at run time via the setenv function

interface
   function setenv(var_name,new_value,change_flag) bind(C,name='setenv')
      use ISO_C_BINDING
      implicit none
      integer(C_INT) setenv
      character(KIND=C_CHAR) var_name(*)
      character(KIND=C_CHAR) new_value(*)
      integer(C_INT), value :: change_flag
   end function setenv
end interface

But I just now typed that in... OK I'll test it... didn't work: MinGW wants putenv:

interface
   function putenv(string) bind(C,name='_putenv')
      use ISO_C_BINDING
      implicit none
      integer(C_INT) putenv
      character(KIND=C_CHAR) string(*)
   end function putenv
end interface

Compiled, linked, and ran, returning success, but had no effect on stack size! I was afraid of this: the environmental variables seem to be cached at execution time and subsequent changes don't do anything. Also check that OMP_STACKSIZE was set this way via GET_ENVIRONMENT_VARIABLE, end it was. So to change the master thread stack size you need to compile it in, e.g.

gfortran -fopenmp -fstack-arrays omp_test.f90 -oomp_test -Wl,--stack,4194304

Or change it later with DUMPBIN. To change the other thread stack sizes you need to modify the OMP_STACKSIZE variable before execution with, e.g.

set OMP_STACKSIZE=4 M

John_Campbell · ‎07-01-2018

Repeat Offender,

Thanks very much for this info, as I have been able to use the options you provided and create an 8Mb stack for all threads, using mingw gFortran Ver 7.3.0 on Windows 10, or at least I am reporting this stack size for all threads using your GetCurrentThreadStackLimits approach.

For my test program, it reports to run twice as fast using automatic arrays with -fstack-arrays, as using ALLOCATE arrays, which could be promising. It is a program with trivial load on each thread. I also generated a linker symbol map which has lots of info to check. (It may be worth making stack for the master thread larger.)

The commands I used were:
set OMP_STACKSIZE=8 M
gfortran -v -g -fopenmp -fstack-arrays %1.f90 -o %1.exe -Wl,-stack,8388608,-Map=%1.map >> %1.tce 2>&1

The next stage will be to test if this scales up to a real sized problem. I will update this thread when I have results.

It would be worth providing the iFort equivalent of this approach so a useable example is made available of changing and reporting the stack size for all threads in a !$OMP parallel region. I have attached my updated test using your reporting routines.

Thanks again to you, Steve and Tim for your help with this problem.

JVanB · ‎07-02-2018

@John Campbell: good to see that you are moving forward. Hopefully you are using some switch such as -O2 when compiling with gfortran as that compiler doesn't optimize by default the way ifort does when invoked at the command line.

As Steve pointed out my example was designed to be transportable at least between ifort and gfortran. On Linux the ifort syntax for increasing the stack size of the master thread is similar to gfortran which makes sense because it's sort of gcc-compatible rather than MSVC++ compatible on that platform. On Windows I had success with

ifort /Qopenmp omp_test.f90 /link /STACK:8388608

Setting the environmental variable OMP_STACKSIZE to the desired value seems to be the thing to do on all 3 platforms to give non-master threads the desired stack size.

Steve_Lionel · ‎07-02-2018

OMP_STACKSIZE is now the recommended variable - KMP_STACKSIZE is an artifact of a time before there was an OMP variable for it. (Trivia: The K in KMP stands for Kuck as in Kuck and Associates (KAI), which had a strong business in an OpenMP Fortran-to-Fortran translator called GUIDE. DVF users may remember this. Intel bought KAI in the late 1990s (we at DEC wanted to but it didn't go through.)

I think you have to set this variable either before the run or before your first parallel region.

JVanB · ‎07-02-2018

More experimentation with

i = putenv('OMP_STACKSIZE=8 M'//achar(0))

as in Quote #10 leads me to the conclusion that the environmental variable has to be changed not only before the first parallel region but even before the first call to an OMP_* subroutine, both in gfortran and ifort. Also gfortran doesn't seem to permit the spawned threads to have bigger stacks than the master thread, although ifort doesn't seem to have this restriction.

WHARR5 · ‎07-02-2018

I know I am getting myself way out on a limb by entering this discussion, but here goes.

I can only tell you of my own experience. Then you can decide for yourself whether or not you want to use it. (I use an Intel Fortran compiler with OMP implemented.)

That Stack Overflow (SO) used to bug me too. As I would develop my programs and execute them from the beginning, ordinarily things would work just fine. It was when I started adding larger and larger arrays (2D, 3D, 4D etc) that SO would jump up and bite me.

By a very careful reading (and STUDY) of my ‘old’ Compaq Fortran Language Reference Manual I realized that all I had to do was add the STATIC attribute to the assignment statement containing those large arrays: INTEGER,STATIC::A(1000,200,3,8).

When the sum of my dimensions exceeded something around 100K, I would get that old SO error, in which case that particular array would then join the ‘STATIC’ assignment statement.

That’s it. No fuss, no bother, no problem. I’m sure some of the experts will take exception to this too-simple fix, but that’s OK. It works for me, and that’s all that counts in my book!

Regards,

Bill

Steve_Lionel · ‎07-02-2018

STATIC is not a standard feature. I would recommend using allocatable arrays for large arrays.

John_Campbell · ‎07-07-2018

I have now done some preliminary tests, replacing ALLOCATE (arrays) with automatic arrays, modifying the stack size to 8mb and using -fstack-arrays. For my real FE calculations, there were 7 vectors (58000) being defined by 10 or 11 threads on an i7-8700K processor.

The ALLOCATE run took 59m:27sec (on 27_Jun), while my first attempt at separate stack arrays took 62m:08sec. I modified the compile options, reduced each stack size from 16mb to 8mb and excluded virus scanning; it took 59m:47sec. Hoping I have not made mistakes in my testing, I would conclude that converting from heap arrays to stack arrays has made an insignificant effect on performance. Perhaps using arrays of 0.5 mb is too big to generate significant cache coherence issues. Certainly heap vs stack shows no difference.

Fortunately, from this I have learnt more about controlling the stack size. I did find that for my particular configuration, use of 'set OMP_STACKSIZE=' was not necessary if the stack was set when linking. Again I should warn it is difficult to extrapolate any of these results to other configurations.

Based on this I would recommend to use heap, solely as it is a more robust approach : no stack overflow errors !
(My best improvement over the last 12 months of development has been upgrading to a new processor with faster memory and more threads)

Thanks again for the assistance.

WHARR5 · ‎07-08-2018

Hello John,

I suspect this info is ‘old hat’ by now, and that you have moved on to more enlightened constructs, but this is the first chance I have had to run your initial program. At my present location I am operating with an old Monarch PC with a 1-core (2 threads) CPU in MS XP pro, with Intel Fortran, Visual Studio, and OMP.

The only change I made to your program was to comment out your DEFAULT clause in your parallel construct. My implementation of OMP did not like that at all. I won’t go into the error statements, just say that was the only change.

When I ran your program sequentially it ran fine, with an execution time of less than a minute. Every one of the (4 x 8 =) 32 ‘mythread’ scalars produced a value of 10.

When I ran your program with OMP generating parallel code, typical 1st two groups of 8 ‘mythread’ outputs each were: (0 1 1 0 1 0 1 0) & (0 1 0 1 0 1 0 1), where the Master thread’s number is 0, and the slave is 1. In successive runs the distributions might change slightly, but each of my two threads always took on 4 of the 8 tasks in each of your first 2 groups.

In the remaining two groups (i.e. Local arrays in stack & Allocate arrays in shared heap), all 16 ‘mythread’ outputs were 0.

When I added your parallel region structure to your 2nd DO Loop, then all 4 of your groups produced similar ( 0,1) ‘mythread’ outputs using my two (2) threads.

Each OMP run took only seconds, some of which was taken up by ‘write’ statements. All of your other output numbers remained stable without change.

Regards,

Bill

TimP · ‎07-09-2018

If a performance difference between stack and heap is seen for large arrays, it might be worth while to profile to find out why.

The time spent in allocation for a large array ought to be relatively small, and there should be no large penalty for using ALLOCATE with error processing, so that you know easily where a possible failure occurs.

John_Campbell · ‎07-09-2018

Tim,

My initial test example (evalFC2.f90), which uses small dimension(4) arrays, appears to show a difference between stack and heap performance. This example could be due to cache coherence issues. However, my main use of !$OMP is with large arrays and this shows no difference. For me there is no big change, so when using large arrays use ALLOCATE.

John_Campbell · ‎04-12-2020

I have come back to this problem of where do PRIVATE copies of arrays go : shared heap or thread stack ?

I have written a program to test a number of different array types. The array types are:

! Fixed size arrays declared in a MODULE mod_array(mm)
! Fixed size arrays declared in a COMMON com_array(mm)
! Local arrays declared in OMP routine local_array(mm)
! Automatic arrays declared in OMP routine auto_array(m)
! routine argument dummy_array(m) that was a local array on the master stack
! routine argument alloc_array(m) that was allocated on the heap
! Allocatable arrays declared in OMP routine array_a(:) and array_b(:)
! array_a has been previously allocated before OMP region
! array_b is allocated in OMP region
!
! integer, parameter :: mm = 10*1024-4 ! would work better if mm = x * 1024 - 16 for 16 byte gap on Heap
! integer :: m ! is a routine argument
!
! Key results for gFortran 8.3
! most OMP PRIVATE arrays are allocated on the thread stack.
! this includes thread=0, so duplicate copies of thread 0 arrays are generated.
! only private arrays with allocatable status are placed on the heap
! while automatic arrays are placed on the heap, their private copies are placed on the stack.
! private arrays placed on the heap are separated by 16 bytes.
! this included private arrays from different threads, being separated by only 16 bytes.
! It would be better if arrays from different threads were placed on a new memory page.

I have tested this attached program on gFortran with some interesting results and would be interested to find out how ifort responds and what others think of my conclusions listed above. The key conclusions that may also apply to iFort are:

Master thread 0 private arrays are duplicated on the master stack, which can be a problem for stack overflow.
Only arrays that are identified as ALLOCATABLE in the !$OMP routine have their PRIVATE copies placed on the HEAP. (Allocated arrays supplied as routine arguments do not)
Private heap arrays for different threads can share the same memory page. A useful option could be when adding a private array to the heap that is for a different thread from the previous array, to start on a new memory page.
Management of AUTOMATIC arrays (based on size) between the stack and heap is done poorly, although this would require allocation at run time. It is only a memory address !

The main reason for investigating this issue is that the Windows Stack (unique for each thread) is not extendable while the extendable Heap is shared between all threads.

Managing arrays between the stack and heap may be addressed with Version 5.0 OpenMP Memory Model, although the options are very cryptic. I am not sure of the status for implementation of Version 5.0 or 5.1 of OpenMP.

Does Linux have the same issues ?