Thread stack (default 4mb per

Peter_F_5 · ‎07-27-2017

I am trying to run an OpenMP Fortran code using Intel Fortran.

At the beginning of my code I allocate 7 very large matrices:

    ! Allocate 
    REAL, ALLOCATABLE :: m1(:,:,:,:,:,:,:,:,:)
    REAL, ALLOCATABLE :: m2(:,:,:,:,:,:,:,:,:)
    REAL, ALLOCATABLE :: m3(:,:,:,:,:,:,:,:,:)
    REAL, ALLOCATABLE :: m4(:,:,:,:,:,:,:,:,:)
    REAL, ALLOCATABLE :: m5(:,:,:,:,:,:,:,:,:)
    REAL, ALLOCATABLE :: m6(:,:,:,:,:,:,:,:,:)
    REAL, ALLOCATABLE :: m7(:,:,:,:,:,:,:,:,:)
    
    ALLOCATE(m1(2,161,20,2,2,21,30,2,2))   
    ALLOCATE(m2(2,161,20,2,2,21,30,2,2))   
    ALLOCATE(m3(2,161,20,2,2,21,30,2,2))   
    ALLOCATE(m4(2,161,20,2,2,21,30,2,2))   
    ALLOCATE(m5(2,161,20,2,2,21,30,2,2))   
    ALLOCATE(m6(1,161,20,2,2,21,30,2,2))          
    ALLOCATE(m7(1,161,20,2,2,21,30,2,2))

I then run a code with a big parallelized loop.

    !$omp parallel do default(private) shared(m1, m2, m3, m4, m5, m7,  someothervariables)

Some of the matrices `m1` to `m7` are indeed used in subroutines, which might lead to creation of temporary arrays.
In any case the code runs fine if done serially. But it crashes with the following error if run with openMP regardless of the number of cores I am using:

The machine I am using has `128GB` of ram. Not sure if this is the limiting factor or not. If I decrease the last index of each matrix to 1, the code runs fine in 24 cores. Given that I am increasing the size of the memory used by 2, it should run in 12 cores, at least no?

Maybe I am doing some big error, or perhaps is just some Fortran option that needs to be changed ...

IanH · ‎07-27-2017

OpenMP programs tend to use much more stack - temporaries need to be created for each thread, and as each thread has an independent stack, that is a convenient place for the compiler to put those temporaries.

What stack size have you specified for the link of your program? Perhaps it needs to be bigger (the default is quite small in the context of today's programs).

Are you using /heaparrays:0 ? If not, try it - the compiler will then use the heap for temporaries, significantly reducing stack space requirements.

Peter_F_5 · ‎07-27-2017

ianh wrote:

OpenMP programs tend to use much more stack - temporaries need to be created for each thread, and as each thread has an independent stack, that is a convenient place for the compiler to put those temporaries.

What stack size have you specified for the link of your program? Perhaps it needs to be bigger (the default is quite small in the context of today's programs).

Are you using /heaparrays:0 ? If not, try it - the compiler will then use the heap for temporaries, significantly reducing stack space requirements.

Thanks for the reply. These are the options I am using which I think are exactly what you mentioned:

IanH · ‎07-27-2017

How about the /heap-arrays setting (note I spelt it wrong)?

Peter_F_5 · ‎07-27-2017

ianh wrote:

How about the /heap-arrays setting (note I spelt it wrong)?

I will try that. As of now apparently setting Stack Commit size to 999999999 solved the problem. Not sure why. Just need to double check but it seems to be the case.

TimP · ‎07-27-2017

Thread stack (default 4mb per thread) counts against stack reserve.

jimdempseyatthecove · ‎07-28-2017

>>Thread stack (default 4mb per thread) counts against stack reserve.

https://msdn.microsoft.com/en-us/library/windows/desktop/ms686774(v=vs.85).aspx
https://docs.microsoft.com/en-us/cpp/build/reference/stack-stack-allocations

The above MS links do an abysmal job of defining just what the stack reserve setting does with respect to a multi-threaded program. Maybe you can comment on this further Tim P.

My interpretation, which is unfounded by any information provided by someone that really knows, is the reserve size is the amount (size) of virtual address space to sequester for use by stack(s) allocations. Note, this is not the same as specifying a stack size. Also note that there is no description of just what happens between program start and as it adds and removes threads. IOW how the reserved address space is wacked up as you add threads.

My assumption (completely unfounded) is the second thread gets half of the reserve space (as reserve) diminishing the reserve space of the first thread in the process. The third thread gets half of the largest of the two existing reserved spaces diminishing the reserve space of the thread who's reserve space was taken, the forth thread gets half of the largest of the existing reserved spaces diminishing the reserve space of the thread who's reserve space was taken, etc... Actual stack allocation from the page file (or physical RAM) occurs only when a/any thread touches a previously untouched page of the virtual address space.

What are your thoughts?

Jim Dempsey

Stackoverflow Fortran openMP