I am struggling to manage the stack !
I have come to the view that stack overflow errors should not exist; that it is laziness of the operating system not to provide a stack overflow extension. I don't expect to win on this one soon, so I need to understand the existing stack management approaches.
Often, the answer to stack overflow errors is to make the stack bigger, but to make it bigger you need to know how big is the stack initially ?
Often I have found this to be a difficult question to answer. Am I stupid because I can’t find the right documentation !
Over the years using many different Fortran compilers, my best approach has always been to avoid the stack, so use ALLOCATE or (before F90) store/allocate arrays in COMMON.
Now for my new question : Are local / automatic arrays better than ALLOCATE arrays in !$OMP and can I show this difference ?
The background to this question is that PRIVATE arrays on separate pages for each thread should produce fewer "cache coherence" problems. My previous approach was to ALLOCATE arrays ( presumably on a single heap ) which does not have the risk of stack overflow problems, but are potentially more likely to have the problem of "cache coherence"
To minimise potential stack overflow issues, I have also used the approach of ALLOCATE ( array(n,m,0:num_threads) ), with array as SHARED. Again, this could have more expected "cache coherence" problems (unless bytes*n*m is page size)
Recently I have been experimenting with use of "the stack" with !$OMP (in gFortran. Is iFort different ?). I was hoping to show that each thread had a separate stack, so that private variables and arrays would be on a separate "memory page" and not produce "cache coherence" problems.
How do I check this ?
I have tried to test this, as I have assumed there is a single heap where ALLOCATE arrays go, which can cause this problem.
I can use LOC to report the address of local automatic arrays or ALLOCATE arrays in subroutines called in a !$OMP region.
Unfortunately the coded test results have indicated there is no difference. Performance is little different and LOC addresses appear mixed between the threads, ie the results are not what I expected so my assumptions or interpretations are wrong !!
Is it possible from Fortran code to find out where the stack is and how big it is ( for each thread in a !$OMP region, if this is the case )
Does anyone have any advice or suggest any links to documentation that may explain this problem ?
I have attached a program I ran using gFortran which provided the inconclusive results I discussed. I have included the .f90 code, the .log run report and .xlsx analysis which shows the order of memory addresses for arrays, for OpenMP runs for local or ALLOCATE arrays. The results appear to show that all threads are mixed in the same memory area and not segregated by thread ID.
Some general observations - OpenMP experts (I am not!) will undoubtedly chime in.
Some operating systems do have automatic stack expansion. OpenVMS does, but Windows does not. Linux doesn't seem to either. On Windows, the stack size is set by a 32-bit value (even in 64-bit applications) in the EXE file and can't be changed dynamically.
In threaded applications, each thread has its own stack. Whether this space comes out of the regular stack or is dynamically allocated is an implementation detail of the specific threading package. Because threaded applications share the same address space, the stack can't be expanded once established.
I'll leave the "cache coherence" part of your question to others.
There is always GetCurrentThreadStackLimits. I tried to create an example with openmp but for some reason the program would never spawn any new threads. So much for my first attempt at openmp.
EDIT: I was misspelling as !OMP$. Too much directive-enhanced compilation :) So now it works:
! omp_test.f90 ! gfortran -fopenmp omp_test.f90 -oomp_test module M use ISO_C_BINDING implicit none private public GetCurrentThreadStackLimits interface subroutine GetCurrentThreadStackLimits(LowLimit,HighLimit) & bind(C,name='GetCurrentThreadStackLimits') import implicit none !GCC$ ATTRIBUTES STDCALL:: GetCurrentThreadStackLimits !DEC$ ATTRIBUTES STDCALL:: GetCurrentThreadStackLimits type(C_PTR) LowLimit type(C_PTR) HighLimit end subroutine GetCurrentThreadStackLimits end interface type, public :: T type(C_PTR) :: Address = C_NULL_PTR type(C_PTR) :: LowLimit = C_NULL_PTR type(C_PTR) :: HighLimit = C_NULL_PTR end type T type(T), allocatable, public :: Tarray(:) public S contains recursive subroutine S(i) use omp_lib integer i integer, target :: j j = OMP_GET_THREAD_NUM() Tarray(j)%Address = C_LOC(j) call GetCurrentThreadStackLimits(Tarray(j)%LowLimit,Tarray(j)%HighLimit) end subroutine S end module M program P use M use ISO_C_BINDING use omp_lib implicit none integer, parameter :: N = 100 integer i integer, parameter :: Nthread = 10 allocate(Tarray(0:Nthread-1)) call OMP_SET_NUM_THREADS(Nthread) call OMP_SET_DYNAMIC(.FALSE.) !$OMP PARALLEL DO & !$OMP DEFAULT(NONE) & !$OMP private(i) & !$OMP SHARED(Tarray) do i = 1, N call S(i) end do !$OMP END PARALLEL DO do i = 0, Nthread-1 write(*,*) 'Thread = ', i write(*,*) 'Address = ', transfer(Tarray(i)%address,0_C_INTPTR_T) write(*,*) 'LowLimit = ', transfer(Tarray(i)%LowLimit,0_C_INTPTR_T) write(*,*) 'HighLimit = ', transfer(Tarray(i)%HighLimit,0_C_INTPTR_T) end do end program P
Output with gfortran:
Thread = 0 Address = 2357948 LowLimit = 262144 HighLimit = 2359296 Thread = 1 Address = 32767468 LowLimit = 30670848 HighLimit = 32768000 Thread = 2 Address = 34864620 LowLimit = 32768000 HighLimit = 34865152 Thread = 3 Address = 36961772 LowLimit = 34865152 HighLimit = 36962304 Thread = 4 Address = 39058924 LowLimit = 36962304 HighLimit = 39059456 Thread = 5 Address = 41156076 LowLimit = 39059456 HighLimit = 41156608 Thread = 6 Address = 43253228 LowLimit = 41156608 HighLimit = 43253760 Thread = 7 Address = 45350380 LowLimit = 43253760 HighLimit = 45350912 Thread = 8 Address = 47447532 LowLimit = 45350912 HighLimit = 47448064 Thread = 9 Address = 49544684 LowLimit = 47448064 HighLimit = 49545216
@John Campbell: I found out that the reason gfortran is putting your arrays on the heap while mine is putting my scalar on the stack is that gfortran always allocates local arrays with non-constant bounds on the heap unless the switch -fstack-arrays is in effect. So use this switch and perhaps try out my interface to GetCurrentThreadStackLimits and I think your program will behave more as expected.
Steve & Repeat Offender,
Thanks for your replies. You have helped a great deal and given me something to work on.
I shall now test if having private arrays on separate memory pages for each thread does have a measurable effect on performance.
There are many things that can affect performance, we have to learn which are the more significant !
A small addition to what was said above: when we first started working with OpenMP on Windows95, the thread stack allocations were only 1MB by default. Among the problems posed were cache evictions due to mapping conflicts among threads. A 2MB thread stack was sufficient to solve this problem, and that has remained the default for 32-bit Windows. For 64-bit OS on the same CPUs, the default has been 4MB from the beginning, as 2MB is so frequently insufficient, and 4MB doesn't pose a problem with many threads using excessive stack unless possibly when hyperthreads are used. Any useable value would keep the stacks in separate 4KB pages so we needn't worry about a TLB eviction affecting another thread. I didn't try to use your data to see if the stacks are page aligned, which might be a small advantage. On a NUMA system we depend on the OS placing the stacks in local memory. It seems unlikely that large memory usage by an application would prevent a new thread starting up with a local stack of any reasonable size.
FWIW, RO's program builds and runs fine in ifort 18. In ifort, you can USE KERNEL32 to get the declaration for GetCurrentThreadStackLimits, but RO's version works too. Here's the 64-bit version on my PC:
D:\Projects\Console6>ifort /Qopenmp console6.f90 Intel(R) Visual Fortran Intel(R) 64 Compiler for applications running on Intel(R) 64, Version 22.214.171.124 Build 20180210 Copyright (C) 1985-2018 Intel Corporation. All rights reserved. Microsoft (R) Incremental Linker Version 14.14.26431.0 Copyright (C) Microsoft Corporation. All rights reserved. -out:console6.exe -subsystem:console -defaultlib:libiomp5md.lib -nodefaultlib:vcomp.lib -nodefaultlib:vcompd.lib console6.obj D:\Projects\Console6>console6.exe Thread = 0 Address = 691505787352 LowLimit = 691504742400 HighLimit = 691505790976 Thread = 1 Address = 691516012760 LowLimit = 691511754752 HighLimit = 691516014592 Thread = 2 Address = 691520272088 LowLimit = 691516014592 HighLimit = 691520274432 Thread = 3 Address = 691524532568 LowLimit = 691520274432 HighLimit = 691524534272 Thread = 4 Address = 691528791128 LowLimit = 691524534272 HighLimit = 691528794112 Thread = 5 Address = 691533052376 LowLimit = 691528794112 HighLimit = 691533053952 Thread = 6 Address = 691537311576 LowLimit = 691533053952 HighLimit = 691537313792 Thread = 7 Address = 691541571672 LowLimit = 691537313792 HighLimit = 691541573632 Thread = 8 Address = 691545831512 LowLimit = 691541573632 HighLimit = 691545833472 Thread = 9 Address = 691550091864 LowLimit = 691545833472 HighLimit = 691550093312
Yeah, ifort doesn't seem to put local arrays with non-constant bounds on the heap by default as gfortran does, so my version that takes the address of an array (not posted) works by default in ifort.
But one thing I found out to my horror is that openmp relies on environmental variables to set the stack size for the threads it spawns. Of course there is no way to set up the stack size for the master thread, but there is only a MASTER directive, while to get only threads with known stack size one would need a NOMASTER directive. Apparently one has to recompile to change the master thread stack size. I suppose one could use CreateThread to start a new thread with known stack size (the program could inquire the stack size of openmp-spawned threads then use CreateThread to spawn a thread with this size) and then use that as the master thread, but it's shocking that openmp makes the programmer jump through hoops like that to avoid stack overflow. Do coarrays have the same issue?
Coarrays don't have this issue - a coarray program is more like MPI where each "image" has its own storage. There's no "thread stack" involved. A typical coarray is dynamically allocated, though you can declare one to be a module or program variable like any other array. Unlike regular arrays, there is no concept of an "automatic" coarray.
I can't get your example program to run. I am using 64-bit gFortran on Windows 7-64, but when building it does not find GetCurrentThreadStackLimits. Do you have any suggestions ?
I am also surprised/confused by the difficulty in changing the stack size for threads. If I am going to target local arrays on a unique stack, 4Mb is not enough so how can we change the stack size for all threads, including the master ? Using ALLOCATE on a shared heap appears more robust.
It looks like you need Windows 8 for GetCurrentThreadStackLimits. Intel Fortran has KMP_SET_STACKSIZE_S to set the stack size of threads you will subsequently create, but it doesn't seem to be documented for gfortran. There is also the possibility of setting the OMP_STACKSIZE environmental variable at run time via the setenv function
interface function setenv(var_name,new_value,change_flag) bind(C,name='setenv') use ISO_C_BINDING implicit none integer(C_INT) setenv character(KIND=C_CHAR) var_name(*) character(KIND=C_CHAR) new_value(*) integer(C_INT), value :: change_flag end function setenv end interface
But I just now typed that in... OK I'll test it... didn't work: MinGW wants putenv:
interface function putenv(string) bind(C,name='_putenv') use ISO_C_BINDING implicit none integer(C_INT) putenv character(KIND=C_CHAR) string(*) end function putenv end interface
Compiled, linked, and ran, returning success, but had no effect on stack size! I was afraid of this: the environmental variables seem to be cached at execution time and subsequent changes don't do anything. Also check that OMP_STACKSIZE was set this way via GET_ENVIRONMENT_VARIABLE, end it was. So to change the master thread stack size you need to compile it in, e.g.
gfortran -fopenmp -fstack-arrays omp_test.f90 -oomp_test -Wl,--stack,4194304
Or change it later with DUMPBIN. To change the other thread stack sizes you need to modify the OMP_STACKSIZE variable before execution with, e.g.
set OMP_STACKSIZE=4 M
Thanks very much for this info, as I have been able to use the options you provided and create an 8Mb stack for all threads, using mingw gFortran Ver 7.3.0 on Windows 10, or at least I am reporting this stack size for all threads using your GetCurrentThreadStackLimits approach.
For my test program, it reports to run twice as fast using automatic arrays with -fstack-arrays, as using ALLOCATE arrays, which could be promising. It is a program with trivial load on each thread. I also generated a linker symbol map which has lots of info to check. (It may be worth making stack for the master thread larger.)
The commands I used were:
set OMP_STACKSIZE=8 M
gfortran -v -g -fopenmp -fstack-arrays %1.f90 -o %1.exe -Wl,-stack,8388608,-Map=%1.map >> %1.tce 2>&1
The next stage will be to test if this scales up to a real sized problem. I will update this thread when I have results.
It would be worth providing the iFort equivalent of this approach so a useable example is made available of changing and reporting the stack size for all threads in a !$OMP parallel region. I have attached my updated test using your reporting routines.
Thanks again to you, Steve and Tim for your help with this problem.
@John Campbell: good to see that you are moving forward. Hopefully you are using some switch such as -O2 when compiling with gfortran as that compiler doesn't optimize by default the way ifort does when invoked at the command line.
As Steve pointed out my example was designed to be transportable at least between ifort and gfortran. On Linux the ifort syntax for increasing the stack size of the master thread is similar to gfortran which makes sense because it's sort of gcc-compatible rather than MSVC++ compatible on that platform. On Windows I had success with
ifort /Qopenmp omp_test.f90 /link /STACK:8388608
Setting the environmental variable OMP_STACKSIZE to the desired value seems to be the thing to do on all 3 platforms to give non-master threads the desired stack size.
OMP_STACKSIZE is now the recommended variable - KMP_STACKSIZE is an artifact of a time before there was an OMP variable for it. (Trivia: The K in KMP stands for Kuck as in Kuck and Associates (KAI), which had a strong business in an OpenMP Fortran-to-Fortran translator called GUIDE. DVF users may remember this. Intel bought KAI in the late 1990s (we at DEC wanted to but it didn't go through.)
I think you have to set this variable either before the run or before your first parallel region.
More experimentation with
i = putenv('OMP_STACKSIZE=8 M'//achar(0))
as in Quote #10 leads me to the conclusion that the environmental variable has to be changed not only before the first parallel region but even before the first call to an OMP_* subroutine, both in gfortran and ifort. Also gfortran doesn't seem to permit the spawned threads to have bigger stacks than the master thread, although ifort doesn't seem to have this restriction.
I know I am getting myself way out on a limb by entering this discussion, but here goes.
I can only tell you of my own experience. Then you can decide for yourself whether or not you want to use it. (I use an Intel Fortran compiler with OMP implemented.)
That Stack Overflow (SO) used to bug me too. As I would develop my programs and execute them from the beginning, ordinarily things would work just fine. It was when I started adding larger and larger arrays (2D, 3D, 4D etc) that SO would jump up and bite me.
By a very careful reading (and STUDY) of my ‘old’ Compaq Fortran Language Reference Manual I realized that all I had to do was add the STATIC attribute to the assignment statement containing those large arrays: INTEGER,STATIC::A(1000,200,3,8).
When the sum of my dimensions exceeded something around 100K, I would get that old SO error, in which case that particular array would then join the ‘STATIC’ assignment statement.
That’s it. No fuss, no bother, no problem. I’m sure some of the experts will take exception to this too-simple fix, but that’s OK. It works for me, and that’s all that counts in my book!
STATIC is not a standard feature. I would recommend using allocatable arrays for large arrays.
I have now done some preliminary tests, replacing ALLOCATE (arrays) with automatic arrays, modifying the stack size to 8mb and using -fstack-arrays. For my real FE calculations, there were 7 vectors (58000) being defined by 10 or 11 threads on an i7-8700K processor.
The ALLOCATE run took 59m:27sec (on 27_Jun), while my first attempt at separate stack arrays took 62m:08sec. I modified the compile options, reduced each stack size from 16mb to 8mb and excluded virus scanning; it took 59m:47sec. Hoping I have not made mistakes in my testing, I would conclude that converting from heap arrays to stack arrays has made an insignificant effect on performance. Perhaps using arrays of 0.5 mb is too big to generate significant cache coherence issues. Certainly heap vs stack shows no difference.
Fortunately, from this I have learnt more about controlling the stack size. I did find that for my particular configuration, use of 'set OMP_STACKSIZE=' was not necessary if the stack was set when linking. Again I should warn it is difficult to extrapolate any of these results to other configurations.
Based on this I would recommend to use heap, solely as it is a more robust approach : no stack overflow errors !
(My best improvement over the last 12 months of development has been upgrading to a new processor with faster memory and more threads)
Thanks again for the assistance.
I suspect this info is ‘old hat’ by now, and that you have moved on to more enlightened constructs, but this is the first chance I have had to run your initial program. At my present location I am operating with an old Monarch PC with a 1-core (2 threads) CPU in MS XP pro, with Intel Fortran, Visual Studio, and OMP.
The only change I made to your program was to comment out your DEFAULT clause in your parallel construct. My implementation of OMP did not like that at all. I won’t go into the error statements, just say that was the only change.
When I ran your program sequentially it ran fine, with an execution time of less than a minute. Every one of the (4 x 8 =) 32 ‘mythread’ scalars produced a value of 10.
When I ran your program with OMP generating parallel code, typical 1st two groups of 8 ‘mythread’ outputs each were: (0 1 1 0 1 0 1 0) & (0 1 0 1 0 1 0 1), where the Master thread’s number is 0, and the slave is 1. In successive runs the distributions might change slightly, but each of my two threads always took on 4 of the 8 tasks in each of your first 2 groups.
In the remaining two groups (i.e. Local arrays in stack & Allocate arrays in shared heap), all 16 ‘mythread’ outputs were 0.
When I added your parallel region structure to your 2nd DO Loop, then all 4 of your groups produced similar ( 0,1) ‘mythread’ outputs using my two (2) threads.
Each OMP run took only seconds, some of which was taken up by ‘write’ statements. All of your other output numbers remained stable without change.
If a performance difference between stack and heap is seen for large arrays, it might be worth while to profile to find out why.
The time spent in allocation for a large array ought to be relatively small, and there should be no large penalty for using ALLOCATE with error processing, so that you know easily where a possible failure occurs.
My initial test example (evalFC2.f90), which uses small dimension(4) arrays, appears to show a difference between stack and heap performance. This example could be due to cache coherence issues. However, my main use of !$OMP is with large arrays and this shows no difference. For me there is no big change, so when using large arrays use ALLOCATE.