Solved: Stack size in fortran using OpenMP

davva · ‎10-06-2008

Hi!

I am running a C++ scientific calculation program that heavily relies on fortran code. A number of large matrices are used (128*128*128) and passed around in fortran. Right now most of them reside in a common block. They are also passed as arguments in different subroutines where they are allocated as

real*4 matrix(128*128*128).

I am suspecting this to be the reason why I need such a big stack size (150 MB) when I parallellized the program with OpenMP. Is this a correct assumption? Without OpenMP the stack size need to be around 10 MB.

What can I do to reduce the stack size? Allocate the matrices dynamically?

How do I pass dynamically allocated matrizes in C (std::vector) to fortran without declaring the matrix as real*4 matrix(128*128*128) inside the subroutine and thus needing a big stack size? That is, how do I pass C vectors into fortran and treat them as dynamically allocated vectors in fortran?

Is there another reasone why so great stack sizes are needed when I use OpenMP?

Best regards, David

Ron_Green · ‎10-07-2008

Yes, OpenMP generally requires substantially more stack space than a serial program. All your PRIVATE data is stack allocated, as each thread needs a private copy (stacks are NOT shared by threads, heap is).

I don't believe it's your C -> Fortran calling that is eating up stack space. It is the PRIVATE data in your OMP regions. You could try to revisit your declaration of data in the OMP parallel regions and see if there is data that can be made shared, but in many cases you really do want PRIVATE data (for data safety and correctness).

ron

View solution in original post

Ron_Green · ‎10-07-2008

Yes, OpenMP generally requires substantially more stack space than a serial program. All your PRIVATE data is stack allocated, as each thread needs a private copy (stacks are NOT shared by threads, heap is).

I don't believe it's your C -> Fortran calling that is eating up stack space. It is the PRIVATE data in your OMP regions. You could try to revisit your declaration of data in the OMP parallel regions and see if there is data that can be made shared, but in many cases you really do want PRIVATE data (for data safety and correctness).

ron

davva · ‎10-08-2008

Quoting - Ronald Green (Intel)

Yes, OpenMP generally requires substantially more stack space than a serial program. All your PRIVATE data is stack allocated, as each thread needs a private copy (stacks are NOT shared by threads, heap is).

I don't believe it's your C -> Fortran calling that is eating up stack space. It is the PRIVATE data in your OMP regions. You could try to revisit your declaration of data in the OMP parallel regions and see if there is data that can be made shared, but in many cases you really do want PRIVATE data (for data safety and correctness).

ron

Thanks Ron!

I think it is strange that a program that needs 10 MB stack all of a sudden needs 136 MB of stack. Anyway, I have some follow up Q's. My parallellized do loop contains two subroutine where the first provides input to the second. That is in the first subroutine some large vectors are filled and supplied to the second subroutine for crunching.

My private variable list is quite large (~20 variables where 8 is vectors of a few thousand elements). I managed to cut the vector lengths in half but that had no effect on the stack size.

How can vectors of a few thousands integers cause a stack size of 150 MB?

Why didn't I get a reduction in stack size when I cut the vectors in half?

The first subroutine in my parallellized calls some other subroutines, could that be a reason for the big stack?

Would it be better to allocate the long private vectors dynamically per thread instead?

That was quite a few questions. Thanks for taking the time to answer them!!

/david

TimP · ‎10-08-2008

Quoting - david.eriksson@se.nucletron.com

Thanks Ron!

I think it is strange that a program that needs 10 MB stack all of a sudden needs 136 MB of stack. Anyway, I have some follow up Q's. My parallellized do loop contains two subroutine where the first provides input to the second. That is in the first subroutine some large vectors are filled and supplied to the second subroutine for crunching.

My private variable list is quite large (~20 variables where 8 is vectors of a few thousand elements). I managed to cut the vector lengths in half but that had no effect on the stack size.

How can vectors of a few thousands integers cause a stack size of 150 MB?

Why didn't I get a reduction in stack size when I cut the vectors in half?

The first subroutine in my parallellized calls some other subroutines, could that be a reason for the big stack?

Would it be better to allocate the long private vectors dynamically per thread instead?

That was quite a few questions. Thanks for taking the time to answer them!!

/david

TimP · ‎10-08-2008

Quoting - david.eriksson@se.nucletron.com

Thanks Ron!

I think it is strange that a program that needs 10 MB stack all of a sudden needs 136 MB of stack. Anyway, I have some follow up Q's. My parallellized do loop contains two subroutine where the first provides input to the second. That is in the first subroutine some large vectors are filled and supplied to the second subroutine for crunching.

My private variable list is quite large (~20 variables where 8 is vectors of a few thousand elements). I managed to cut the vector lengths in half but that had no effect on the stack size.

How can vectors of a few thousands integers cause a stack size of 150 MB?

Why didn't I get a reduction in stack size when I cut the vectors in half?

The first subroutine in my parallellized calls some other subroutines, could that be a reason for the big stack?

Would it be better to allocate the long private vectors dynamically per thread instead?

That was quite a few questions. Thanks for taking the time to answer them!!

/david

TimP · ‎10-08-2008

If you have local arrays, /Qopenmp changes their default allocation from static to stack. Each thread then will allocate those arrays on its stack for each subroutine in the parallel region.

davva · ‎10-09-2008

Quoting - tim18

If you have local arrays, /Qopenmp changes their default allocation from static to stack. Each thread then will allocate those arrays on its stack for each subroutine in the parallel region.

Thanks for the update!

I still think it's strange that a 10 MB stack turns into 136 MB stack when I am using OpenMP. The total memory size of the private variable list is <1 MB!

What about common blocks, are they reallocated per thread even if they are not declared PRIVATE?

What happens if I allocate some private variables dynamically outside the threaded area? Will they also be reallocated on the stack per thread?

Why didn't I get a reduction in stack size when I reduced the vectors to half their size?

Confusing!

/david

jimdempseyatthecove · ‎10-09-2008

David,

My guess as to the excessive stack consumption is after your C code passes a pointer to its dataset your Fortran code is rearranging it either explicitly or implicitly (compiler creating stack temp array for some operations). You need to identify where these occurances are happening and eliminate them. To eliminate them use THREADPRIVATE to contain unallocated the array descriptors (or pointers to array descriptors). Then during run time have each thread allocate the arrays to the extent they require. To find the problem areas, specify a smaller stack size and run to the choke point. As you identify the arrays, move the descriptors into the THREADPRIVATE area (then add the allocations).

Jim Dempsey

davva · ‎10-20-2008

Quoting - Ronald Green (Intel)

Yes, OpenMP generally requires substantially more stack space than a serial program. All your PRIVATE data is stack allocated, as each thread needs a private copy (stacks are NOT shared by threads, heap is).

I don't believe it's your C -> Fortran calling that is eating up stack space. It is the PRIVATE data in your OMP regions. You could try to revisit your declaration of data in the OMP parallel regions and see if there is data that can be made shared, but in many cases you really do want PRIVATE data (for data safety and correctness).

ron

Hi again!

I managed to solve the stack size by allocating several vectors dynamically instead of statically. I ended up with a stack size of about 10 MB. Is that reasonably? Can the stack size affect the performance in anyway?

A related question, will the stack size affect the threading efficiency. What kind of overhead is it to allocate the stack for each thread?

Thanks!

davva · ‎10-28-2008

Quoting - davva

Hi again!

I managed to solve the stack size by allocating several vectors dynamically instead of statically. I ended up with a stack size of about 10 MB. Is that reasonably? Can the stack size affect the performance in anyway?

A related question, will the stack size affect the threading efficiency. What kind of overhead is it to allocate the stack for each thread?

Thanks!

Another question!!!

The arrays that I am accessing frequently in the inner most loop are now allocated dynamically. Will that affect the performance? What is faster accessing heap or stack memory? I might have some cache misses but not many.

Thanx!

/davva

Steven_L_Intel1 · ‎10-28-2008

There is no difference accessing, but there is a small overhead for allocating and deallocating heap memory. If your loops are large, the time spent to do this should be inconsequential.

jimdempseyatthecove · ‎10-29-2008

davva,

If the allocation/deallocation in your inner loop is creating unwanted overhead consider creating a ThreadPrivate array where on entry to your inner level you check the size of the private array, if too small or not allocated you delete and allocate (or reallocate) otherwise use the existing allocation (truncated to desired size). You can also create an array of arrays and index that by OpenMP thread number. *** Caution, not advisible if using nested threads. Either use a thread private array or a thread private unique index into the array of arrays.

Jim Dempsey

jimdempseyatthecove · ‎10-29-2008

As for passing an std::vector on to Fortran. The vector storage scheme is purposely opaque. Data is intended to be accessed by way of member functions only and not by indepentend pointer and index. There is no requirement for the floats to be stored together and thus permit you to pass the address of the first float plus size on to a Fortran subroutine for use. Although your version of std::vector could today use a contiguous array, std::vector could change at the next software update. A new and improved version could break your code. Example: Assume you were to making a programming "improvement" by converting to parallel programming techniquesby using tbb::concurrent_vector (this is a new and improvedthread-safe vector container) then you are "almost" guaranteeing that the floats are in discontiguous chunks after you reach the number of floats that exceed thefirst bucket threshold. Initial testing with small numbers of floats will work, but later testing will fail (usually at your customer site).If your computational requirements are high then consider not using containers (that do not provide forever gauranteedaccess to the address of a contiguous array). It is not very hard for you to make your own containers for your own purposes that do exactly what you want.

Jim Dempsey

davva · ‎10-30-2008

Quoting - jimdempseyatthecove

davva,

If the allocation/deallocation in your inner loop is creating unwanted overhead consider creating a ThreadPrivate array where on entry to your inner level you check the size of the private array, if too small or not allocated you delete and allocate (or reallocate) otherwise use the existing allocation (truncated to desired size). You can also create an array of arrays and index that by OpenMP thread number. *** Caution, not advisible if using nested threads. Either use a thread private array or a thread private unique index into the array of arrays.

Jim Dempsey

HI Jim!

Thanx for your answer.

I have already implemented an array of arrays and are accessing those by thread index. So it was a good caution you mentioned. The threaded code is today not netsted but you never now in the future. So how are the threads indexed if they are nested? How do I create a thread unique index?

The original question was whether or not accessing large arrays are faster or slower if they are allocated on the heap? Allocation are done outside of the inner loops.

/davva

davva · ‎10-30-2008

Quoting - jimdempseyatthecove

As for passing an std::vector on to Fortran. The vector storage scheme is purposely opaque. Data is intended to be accessed by way of member functions only and not by indepentend pointer and index. There is no requirement for the floats to be stored together and thus permit you to pass the address of the first float plus size on to a Fortran subroutine for use. Although your version of std::vector could today use a contiguous array, std::vector could change at the next software update. A new and improved version could break your code. Example: Assume you were to making a programming "improvement" by converting to parallel programming techniquesby using tbb::concurrent_vector (this is a new and improvedthread-safe vector container) then you are "almost" guaranteeing that the floats are in discontiguous chunks after you reach the number of floats that exceed thefirst bucket threshold. Initial testing with small numbers of floats will work, but later testing will fail (usually at your customer site).If your computational requirements are high then consider not using containers (that do not provide forever gauranteedaccess to the address of a contiguous array). It is not very hard for you to make your own containers for your own purposes that do exactly what you want.

Jim Dempsey

Hi Jim!

We have been relying heavily on std::vectors having contigous memory and it has worked so far (using Microsoft STL). Thanks for the heads up, I will read up on MS latest STL and see what they guarantee.

/david