- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi!
I am running a C++ scientific calculation program that heavily relies on fortran code. A number of large matrices are used (128*128*128) and passed around in fortran. Right now most of them reside in a common block. They are also passed as arguments in different subroutines where they are allocated as
real*4 matrix(128*128*128).
I am suspecting this to be the reason why I need such a big stack size (150 MB) when I parallellized the program with OpenMP. Is this a correct assumption? Without OpenMP the stack size need to be around 10 MB.
What can I do to reduce the stack size? Allocate the matrices dynamically?
How do I pass dynamically allocated matrizes in C (std::vector) to fortran without declaring the matrix as real*4 matrix(128*128*128) inside the subroutine and thus needing a big stack size? That is, how do I pass C vectors into fortran and treat them as dynamically allocated vectors in fortran?
Is there another reasone why so great stack sizes are needed when I use OpenMP?
Best regards, David
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Yes, OpenMP generally requires substantially more stack space than a serial program. All your PRIVATE data is stack allocated, as each thread needs a private copy (stacks are NOT shared by threads, heap is).
I don't believe it's your C -> Fortran calling that is eating up stack space. It is the PRIVATE data in your OMP regions. You could try to revisit your declaration of data in the OMP parallel regions and see if there is data that can be made shared, but in many cases you really do want PRIVATE data (for data safety and correctness).
ron
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Yes, OpenMP generally requires substantially more stack space than a serial program. All your PRIVATE data is stack allocated, as each thread needs a private copy (stacks are NOT shared by threads, heap is).
I don't believe it's your C -> Fortran calling that is eating up stack space. It is the PRIVATE data in your OMP regions. You could try to revisit your declaration of data in the OMP parallel regions and see if there is data that can be made shared, but in many cases you really do want PRIVATE data (for data safety and correctness).
ron
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Yes, OpenMP generally requires substantially more stack space than a serial program. All your PRIVATE data is stack allocated, as each thread needs a private copy (stacks are NOT shared by threads, heap is).
I don't believe it's your C -> Fortran calling that is eating up stack space. It is the PRIVATE data in your OMP regions. You could try to revisit your declaration of data in the OMP parallel regions and see if there is data that can be made shared, but in many cases you really do want PRIVATE data (for data safety and correctness).
ron
Thanks Ron!
I think it is strange that a program that needs 10 MB stack all of a sudden needs 136 MB of stack. Anyway, I have some follow up Q's. My parallellized do loop contains two subroutine where the first provides input to the second. That is in the first subroutine some large vectors are filled and supplied to the second subroutine for crunching.
My private variable list is quite large (~20 variables where 8 is vectors of a few thousand elements). I managed to cut the vector lengths in half but that had no effect on the stack size.
How can vectors of a few thousands integers cause a stack size of 150 MB?
Why didn't I get a reduction in stack size when I cut the vectors in half?
The first subroutine in my parallellized calls some other subroutines, could that be a reason for the big stack?
Would it be better to allocate the long private vectors dynamically per thread instead?
That was quite a few questions. Thanks for taking the time to answer them!!
/david
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks Ron!
I think it is strange that a program that needs 10 MB stack all of a sudden needs 136 MB of stack. Anyway, I have some follow up Q's. My parallellized do loop contains two subroutine where the first provides input to the second. That is in the first subroutine some large vectors are filled and supplied to the second subroutine for crunching.
My private variable list is quite large (~20 variables where 8 is vectors of a few thousand elements). I managed to cut the vector lengths in half but that had no effect on the stack size.
How can vectors of a few thousands integers cause a stack size of 150 MB?
Why didn't I get a reduction in stack size when I cut the vectors in half?
The first subroutine in my parallellized calls some other subroutines, could that be a reason for the big stack?
Would it be better to allocate the long private vectors dynamically per thread instead?
That was quite a few questions. Thanks for taking the time to answer them!!
/david
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks Ron!
I think it is strange that a program that needs 10 MB stack all of a sudden needs 136 MB of stack. Anyway, I have some follow up Q's. My parallellized do loop contains two subroutine where the first provides input to the second. That is in the first subroutine some large vectors are filled and supplied to the second subroutine for crunching.
My private variable list is quite large (~20 variables where 8 is vectors of a few thousand elements). I managed to cut the vector lengths in half but that had no effect on the stack size.
How can vectors of a few thousands integers cause a stack size of 150 MB?
Why didn't I get a reduction in stack size when I cut the vectors in half?
The first subroutine in my parallellized calls some other subroutines, could that be a reason for the big stack?
Would it be better to allocate the long private vectors dynamically per thread instead?
That was quite a few questions. Thanks for taking the time to answer them!!
/david
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If you have local arrays, /Qopenmp changes their default allocation from static to stack. Each thread then will allocate those arrays on its stack for each subroutine in the parallel region.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If you have local arrays, /Qopenmp changes their default allocation from static to stack. Each thread then will allocate those arrays on its stack for each subroutine in the parallel region.
Thanks for the update!
I still think it's strange that a 10 MB stack turns into 136 MB stack when I am using OpenMP. The total memory size of the private variable list is <1 MB!
What about common blocks, are they reallocated per thread even if they are not declared PRIVATE?
What happens if I allocate some private variables dynamically outside the threaded area? Will they also be reallocated on the stack per thread?
Why didn't I get a reduction in stack size when I reduced the vectors to half their size?
Confusing!
/david
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
David,
My guess as to the excessive stack consumption is after your C code passes a pointer to its dataset your Fortran code is rearranging it either explicitly or implicitly (compiler creating stack temp array for some operations). You need to identify where these occurances are happening and eliminate them. To eliminate them use THREADPRIVATE to contain unallocated the array descriptors (or pointers to array descriptors). Then during run time have each thread allocate the arrays to the extent they require. To find the problem areas, specify a smaller stack size and run to the choke point. As you identify the arrays, move the descriptors into the THREADPRIVATE area (then add the allocations).
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Yes, OpenMP generally requires substantially more stack space than a serial program. All your PRIVATE data is stack allocated, as each thread needs a private copy (stacks are NOT shared by threads, heap is).
I don't believe it's your C -> Fortran calling that is eating up stack space. It is the PRIVATE data in your OMP regions. You could try to revisit your declaration of data in the OMP parallel regions and see if there is data that can be made shared, but in many cases you really do want PRIVATE data (for data safety and correctness).
ron
Hi again!
I managed to solve the stack size by allocating several vectors dynamically instead of statically. I ended up with a stack size of about 10 MB. Is that reasonably? Can the stack size affect the performance in anyway?
A related question, will the stack size affect the threading efficiency. What kind of overhead is it to allocate the stack for each thread?
Thanks!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi again!
I managed to solve the stack size by allocating several vectors dynamically instead of statically. I ended up with a stack size of about 10 MB. Is that reasonably? Can the stack size affect the performance in anyway?
A related question, will the stack size affect the threading efficiency. What kind of overhead is it to allocate the stack for each thread?
Thanks!
Another question!!!
The arrays that I am accessing frequently in the inner most loop are now allocated dynamically. Will that affect the performance? What is faster accessing heap or stack memory? I might have some cache misses but not many.
Thanx!
/davva
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
There is no difference accessing, but there is a small overhead for allocating and deallocating heap memory. If your loops are large, the time spent to do this should be inconsequential.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
davva,
If the allocation/deallocation in your inner loop is creating unwanted overhead consider creating a ThreadPrivate array where on entry to your inner level you check the size of the private array, if too small or not allocated you delete and allocate (or reallocate) otherwise use the existing allocation (truncated to desired size). You can also create an array of arrays and index that by OpenMP thread number. *** Caution, not advisible if using nested threads. Either use a thread private array or a thread private unique index into the array of arrays.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
As for passing an std::vector
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
davva,
If the allocation/deallocation in your inner loop is creating unwanted overhead consider creating a ThreadPrivate array where on entry to your inner level you check the size of the private array, if too small or not allocated you delete and allocate (or reallocate) otherwise use the existing allocation (truncated to desired size). You can also create an array of arrays and index that by OpenMP thread number. *** Caution, not advisible if using nested threads. Either use a thread private array or a thread private unique index into the array of arrays.
Jim Dempsey
HI Jim!
Thanx for your answer.
I have already implemented an array of arrays and are accessing those by thread index. So it was a good caution you mentioned. The threaded code is today not netsted but you never now in the future. So how are the threads indexed if they are nested? How do I create a thread unique index?
The original question was whether or not accessing large arrays are faster or slower if they are allocated on the heap? Allocation are done outside of the inner loops.
/davva
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
As for passing an std::vector
Jim Dempsey
Hi Jim!
We have been relying heavily on std::vectors having contigous memory and it has worked so far (using Microsoft STL). Thanks for the heads up, I will read up on MS latest STL and see what they guarantee.
/david

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page