Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.

Stack with !$OMP

John_Campbell
New Contributor II
4,246 Views

I am struggling to manage the stack !

I have come to the view that stack overflow errors should not exist; that it is laziness of the operating system not to provide a stack overflow extension. I don't expect to win on this one soon, so I need to understand the existing stack management approaches. 

Often, the answer to stack overflow errors is to make the stack bigger, but to make it bigger you need to know how big is the stack initially ?
Often I have found this to be a difficult question to answer. Am I stupid because I can’t find the right documentation !
Over the years using many different Fortran compilers, my best approach has always been to avoid the stack, so use ALLOCATE or (before F90) store/allocate arrays in COMMON. 

Now for my new question : Are local / automatic arrays better than ALLOCATE arrays in !$OMP and can I show this difference ? 

The background to this question is that PRIVATE arrays on separate pages for each thread should produce fewer "cache coherence" problems. My previous approach was to ALLOCATE arrays ( presumably on a single heap ) which does not have the risk of stack overflow problems, but are potentially more likely to have the problem of "cache coherence"
To minimise potential stack overflow issues, I have also used the approach of ALLOCATE ( array(n,m,0:num_threads) ), with array as SHARED. Again, this could have more expected "cache coherence" problems (unless bytes*n*m is page size)

Recently I have been experimenting with use of "the stack" with !$OMP (in gFortran. Is iFort different ?). I was hoping to show that each thread had a separate stack, so that private variables and arrays would be on a separate "memory page" and not produce "cache coherence" problems.

How do I check this ?
I have tried to test this, as I have assumed there is a single heap where ALLOCATE arrays go, which can cause this problem.
I can use LOC to report the address of local automatic arrays or ALLOCATE arrays in subroutines called in a !$OMP region.
Unfortunately the coded test results have indicated there is no difference. Performance is little different and LOC addresses appear mixed between the threads,
ie the results are not what I expected so my assumptions or interpretations are wrong !!

Is it possible from Fortran code to find out where the stack is and how big it is ( for each thread in a !$OMP region, if this is the case )
Does anyone have any advice or suggest any links to documentation that may explain this problem ?

I have attached a program I ran using gFortran which provided the inconclusive results I discussed. I have included the .f90 code, the .log run report and .xlsx analysis which shows the order of memory addresses for arrays, for OpenMP runs for local or ALLOCATE arrays. The results appear to show that all threads are mixed in the same memory area and not segregated by thread ID.

0 Kudos
32 Replies
jimdempseyatthecove
Honored Contributor III
832 Views

The problem you have is the case of knowing your limitations. As Dirty Harry says: "A man's got to know his limitations."

One of the problems the inexperienced programmers have, and learned from Linux programming, is that -ulimit, which states it means unlimited stack, does not in fact mean the stack is unlimited. And in particular, does not mean unlimited for each thread in a multi-threaded process.

In a multi-threaded process, all threads of the process share the same Virtual Address space. A portion of which is carved out for system. The remaining virtual address space is unclaimed. A portion of the remaining Virtual Address space may be identified as will not be available (aka reserved).
Process load, instantiates a single thread. And loads in the code and static data.

What is left of the available address space can be dynamically claimed by the heap and stack, note, stack of the main and only thread at this time.

*** at this point there is only 1 thread in the process.

Should the process remain only 1 thread, then the situation is relatively simple, the unused heap, as heap expands, can grow upwards, while the unused stack of the single thread grows downward. You run out of memory at the point of: a) collision of two expanding addresses, b) out of page file pages, c) possibly out of physical address space mapping which is dependent on CPU, on some systems this is 48 bits.

With 1 thread, it is relatively simple to not define how you partition the (unused) memory between heap and stack. IOW -ulimit stack means "until you collide with heap".

Now then, as you add threads, recall they all share the same address space. Where do you place the new thread's stack?

You cannot place it immediately below the master thread's stack, as then the master thread's stack cannot grow.
You cannot place it immediately above the currently used heap, as the new thread's stack couldn't grow into the heap.
What is reasonable to use as an estimate of required size? If you pick too large, then the heap is restricted, if you pick too small then the threads stack is limited.

Is is reasonable to assume to partition the space to half for heap, and half for get_omp_max_threads() number of partitions in that half?
You can use: void kmp_set_stacksize_s(size_tsize), void kmp_set_stacksize(int size), KMP_STACKSIZE, these are not portable and you still have no way of knowing how big to make it, or if you want/need the stack sized to be asymmetric. If you choose too large, this may limit the heap. If the heap is permitted to steal the unused parts of the lowest address spaces of the threads stacks, then the heap becomes fragmented. IOW while total heap space may be available, it might not be available in one piece due to fragmentation.

IMHO the best practice is to keep AUTOMATIC (stack) allocations of manageable sizes, and for potentially large allocations, place them on heap.

... real, allocatable :: work(:) ... !$omp parallel private(work) allocate(work(nVariables)) !$omp do... do ... ... end do !$omp end do deallocate(work) !$omp end parallel ...

Jim Dempsey

0 Kudos
jimdempseyatthecove
Honored Contributor III
832 Views

?? Formatting of {...} code=Fortran is lost???

...
real, allocatable :: work(:)
...
!$omp parallel private(work)
allocate(work(nVariables))
!$omp do...
do ...
  ...
end do
!$omp end do
deallocate(work)
!$omp end parallel
...

Jim Dempsey

0 Kudos
John_Campbell
New Contributor II
832 Views

Jim,

"*** at this point there is only 1 thread in the process."

From using GetCurrentThreadStackLimits  (Quote #3), my impression is the layout of the stack(s) is done at build/link time. If you specify openmp at link time, then I assume stacks for max threads would be configured then. I would then assume this would imply that OMP_STACKSIZE value at link time is used to assign thread stack size.

If this is not the case, with OMP_STACKSIZE being referenced at the start of the program, then the thread stack layout would still be set at the start of the program.

Repeat Offender in Quote #14 above experimented with varying OMP_STACKSIZE during the run, but I am not sure if this works.

My take from this is that in Windows the stack size is fixed so we need to be careful what dynamically sized arrays are placed on these stacks.

I did have an approach of defining a large thread stack for automatic arrays, but a new analysis project had arrays too big !

ALLOCATE is certainly the most robust route to take for large arrays, so I am looking to see if I can minimise memory page conflicts between threads. Hence I want to see if there is some vehicle to place arrays for different threads on different memory pages on the heap.

Fortunately, as the array size increases, page conflicts between threads are less significant, but memory transfer bandwidth become significant. ... Always a new OpenMP side effect to learn about.

I would be interested to hear if my stack/heap usage experience with gFortran is similar in ifort.

John

0 Kudos
jimdempseyatthecove
Honored Contributor III
832 Views

>>If you specify openmp at link time, then I assume stacks for max threads would be configured then

The number of threads is NOT known until run time. The build computer is not necessarily the run computer, and/or the run shell may have constricted the available logical processors for the new process to run within.

Jim Dempsey

0 Kudos
jimdempseyatthecove
Honored Contributor III
832 Views

>>Repeat Offender in Quote #14 above experimented with varying OMP_STACKSIZE during the run, but I am not sure if this works.

That environment variable, which may be captured (once) at program start, as opposed to immediately, and at each occurrence of, thread creation. Additionally, threads are created only once. Therefor,  OMP_STACKSIZE applies prior to first parallel region, and if applicable, first entry into nest level per nesting thread should nesting be used. As to if the OMP_STACKSIZE is interrogated once at program start or at each thread creation would be an implementation issue..

Jim Dempsey

0 Kudos
John_Campbell
New Contributor II
832 Views

I have reviewed my testing of PRIVATE arrays and updated the program I have used for testing.

I have included an allocatable array as a routine argument. Private copies of this array also is placed on the heap, as it has an allocatable attribute.

I have tested the program with a variety of stack size, modified at build time. Both compilers I am using default to thread stacks being the same size as the master stack. I have also tested OMP_STACKSIZE on the 2 compilers. One disregards it completely, while the other only uses the value if it increases the size of the stack for threads 1:max_thread-1. Since I am only using !$OMP PARALLEL DO, the thread 0 stack is always more used that the thread stacks, so I am not sure if OMP_STACKSIZE has a useful functionality.

It also appears that all thread stacks are created at the start of the application, for max_threads of the run environment, not at first !$OMP. I base this on the reported change of memory in use, as the program runs. Hopefully correct!

I was wondering if Repeat Offender, Jim or someone could test the program on iFort, as I don't presently have a usable copy. I would be interested if the default location of private arrays is the same as gFortran, or if there is some uncertainty as to the Heap/Stack allocation. Selecting heap_arrays does not appear to change the private copy destination, although it does the master copy of local arrays.

I have also identified the following useful routines :
function which_stack (loc(array)) {Identifies which thread stack the array is in or not} and 
subroutine get_avail_memory_bytes (avail_bytes)  { reports available free physical memory}

Thanks for your assistance with this investigation,

John

(My attempts to upload .f90 files is reporting and error, but appears to be successful)

0 Kudos
jimdempseyatthecove
Honored Contributor III
832 Views

>>It also appears that all thread stacks are created at the start of the application

Not quite. They are created at the start of the thread. IOW entry into the first parallel region (first time), as well as should your program use nested parallel region the first time any additional threads are created for use by the (new) team of the thread creating the nest level.

I couldn't get your program to run on Windows 7 x64

Your interfaces did not interface to kernel32

substituted USE KERNEL32 for your INTERFACES and type definitions. Got further with that but GetCurrentThreadStackLimits entry point is not found in kernel32.dll on Windows 7.

Added MinCore.lib (Windows 8.0 hack) to dependencies, resolved GetCurrentThreadStackLimits issue,

Got Accdess violation reading location error in ntdll.dll (likely compatibility issue of using MinCore.lib on Windows 7).

Same/similar issue when building Win32

Using Intel Parallel Studio XE 2020.0.166
MS Visual Studio 2019 (16.5.0)

Jim Dempsey

0 Kudos
John_Campbell
New Contributor II
832 Views

Jim,

Thanks very much for looking into this. I should have noted that I am testing on Windows 10 (for GetCurrentThreadStackLimits) and a 64-bit .exe (for get_avail_memory_bytes = ullAvailPhys).

I did some reruns, including more calls to call report_avail_mbytes and separated get_Stack_info from the initialisation of !$OMP to see if I could identify a delay in creating the thread stacks.

I also increased each stack to 250 MBytes.

The net result is I still can not identify a time when omp_get_max_threads stacks are not created, (based on call report_avail_mbytes).

I need to have a run with more threads than max_threads and see what stack is available for the extra threads.
Can I just call omp_set_num_threads to a higher value, or do I need to create nested teams ? I will give it a go and see what I get.

Thanks for your advice,

John

0 Kudos
jimdempseyatthecove
Honored Contributor III
832 Views

John,

Why are you going through with this exercise?

In many cases (a preponderant number) the stack requirement outside the parallel region exceeds that of inside the parallel region. And excepting for unusual cases will a specific team member thread within a parallel region have a stack requirement exceeding that of the master thread of the region. To make the requirement of all the threads having a stack requirement be set to above the worst case requirement may (or may not) introduce undesirable consequences at run time. 64 threads @ 250MB each == 16GB of address space, and potentially 16GB of page file space should the one task requiring the 250MB migrate amongst all threads. Whereas using a heap, or thread-safe pool of buffers would reduce this to near minimum requirements.

While the 64-bit virtual address space (likely 48 or 52 bit available) has ample room to waste 15GB, your system page file may not.

Assume for the moment that all threads at some point concurrently require 250MB additional storage for working data. What is the cost assessment?

The assessment is not multiple threads contesting for heap allocation/deallocation verses each thread making concurrent stack allocation/deallocation.
Rather....

multiple threads contesting for heap allocation/deallocation + concurrent execution time
verses
each thread making concurrent stack allocation/deallocation + concurrent execution time

At what point do you experience diminishing returns?

Jim Dempsey

0 Kudos
John_Campbell
New Contributor II
832 Views

Hi Jim,

You ask "Why are you going through with this exercise?"

I originally started with "I am struggling to manage the stack".

Fortunately with help from yourself and Repeat Offender, I have made some progress.
The API GetCurrentThreadStackLimits provides information on the location, size and usage of each thread stack.
With this, I now have a much better understanding of how much stack is being used by each thread, and also if private copies of arrays go on the stack or heap. Understanding this makes it easier to predict what size stack is required.
If I get a stack overflow, it also provides a way (via modifying the code) to get some indication of what was the stack use before the overflow occurred. Previously the approach was to use an arbitrary new stack size and see what happens.

Now I can manage the stack better, so thanks very much for the help.

Another interesting outcome is when are the thread stacks allocated.
You are correct that they are allocated as the omp usage is defined.
My misunderstanding from my testing is interesting, as I was incorrectly estimating committed memory, so not identifying when memory was allocated to the new thread stacks.
I was basing the estimate of extra memory committed on using the API GlobalMemoryStatusEx, which reports the "Amount of physical memory available".
This available memory is the physical memory pages that are not yet allocated to programs. It changes with memory used and not memory committed. "Used memory" is not "Committed memory"
Importantly it excludes "committed memory" which has not yet had "physical memory" pages allocated (the memory address has not yet been used).
There is also the memory that is being used for file buffering, which is often important to retain.

So estimating the amount of extra memory that could be allocated to new arrays, without the program being "paged out" may be my next struggle !

I hope these issues are of interest to others on the forum.

Again, thanks for your ongoing help,

John

0 Kudos
IanH
Honored Contributor II
832 Views

Perhaps interesting to you, perhaps useful, perhaps not ... the sysinternals vmmap tool can sometimes shed some light on what is happening within a process address space, typically at something approaching a memory page level. 

https://docs.microsoft.com/en-us/sysinternals/downloads/vmmap

You can also run memory traces against a process, in which case you can get details on each object on the operating system provided heaps.  I don't think you can pull apart stack allocations within a thread's stack though.

0 Kudos
jimdempseyatthecove
Honored Contributor III
832 Views

>>...so not identifying when memory was allocated to the new thread stacks

There are two "allocations" of different natures that occur.

At process creation (approximation)

Windows		                        Linux

0x0      {protected page 0}             0x0      {protected page 0}
         {code}                                  {code}
         {static data and literals}              {static data and literals}
         {heap}                                  {heap}
         { vvvv heap growth vvvv }               { vvvv heap growth vvvv }
         .... undesignated ...                   .... undesignated ...
         { ^^^^ stack growth ^^^^ }              { ^^^^ stack growth ^^^^ }
         {stack}                                 {stack}
0x7f..f8 {top of stack}                 0xff..f8 {top of stack}
0x80..00 {start of system space}        0xff..ff {end of virtual memory}
    .... undesignated ...
0xff..ff {end of virtual memory}

Then as DLLs or shared libraries are loaded they acquire memory from the heap.

As stated earlier an "allocation" occurs in two parts. While the user application allocates from the heap, the heap manager manages previously allocated (from O/S) blocks of memory via free lists, and when necessary, expands the free list via interface to the O/S. The O/S will adjust the process's page tables to enable the "allocated" heap expansion virtual address space (and may also reserve page file pages for later use). Then returns to the heap manager then eventually back to the process. The actual allocation (second time) will occur on a "first touch" page fault to previously enabled page table entries (that are not resident).

Part a is the virtual address (page table) "allocation", then later when necessary part b the "first touch" allocation. Note, it is not unusual for an allocation to succeed (virtual address allocated), only later after return, for the program to fail on "first touch" due to lack of resources (e.g. page file full).

In both operating systems, the heap grows upwards and the stack grows downwards (both in sizable blobs of virtual memory pages).

An out of memory condition can occur when either:

a) heap or stack grows into stack or heap (iow collide)
b) When the system is out of resources (e.g. page file size is exceeded or other process limit is exceeded)

On Linux a user can select "ulimit" or unlimited stack allocation. This really is not unlimited as the address range, page table entries, and page file have limited capacities.

Both systems: heap grows upward, stack grows downward, the growth may be impeded by collision or lack of resource.

When the process adds a thread, ask yourself "where do you place the new stack?".

You cannot place it immediately below the process's (or instantiating) thread stack as this would inhibit future growth of that stack.
You cannot place it (partially used large virtual address space) in the heap (growth would conflict with heap).

What you can do is parcel out some of the .... undesignated ... memory. But then how do you do this without foreknowledge how much future heap is required, how much main process stack will be required, how many threads are yet to come, and how much stack space will be required by each thread.

The programmer can program under the rule of "who cares, the O/S will figure it out and there is plenty of memory available". This attitude often leads to frustrated users of the program (and future support calls).

Or, the programmer, can attempt to future-proof the program to a great extent with some additional programming (as you are in the process of doing).

Jim Dempsey

0 Kudos
Reply