Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.
29429 Discussions

Best solution for stack overflow + pointers

mel23
Beginner
1,757 Views

I m not the first one, but I ve a problem of stack overflow with some operations on 'big' matrix defined as pointers (MATMUL(), TRANSPOSE(), allocation of 2 matrix,...), because of contiguous duplicates created in the stack.

I've read most of the previous questions asked about this subject, and I understand that there are many solutions, as incrementing the size of the stack, using DO, and using 'heap arrays' option...

Can you tell mewhat'sthe inconvenience in using 'heap arrays' solution ?

Thanks.

0 Kudos
7 Replies
Steven_L_Intel1
Employee
1,757 Views
The only "inconvenience" is that it is slightly slower to allocate and deallocate the temporaries from the heap rather than using the stack. My view is that if you have code that is creating large temporaries, this added time is nothing compared to the time taken to copy data into and out from the temporaries.
0 Kudos
mel23
Beginner
1,757 Views

So between the solutions :

1- use devectorisation and imbricate loops for all operations when manipulating big matrix,

and

2- use heap option,

2 seems the best ?

Thanks


					
				
			
			
				
			
			
			
			
			
			
			
		
0 Kudos
TimP
Honored Contributor III
1,757 Views
I don't know how you associate vectorization with stack overflow. One guess on what you may be referring to is that MATMUL will allocate a dynamic result matrix. You might avoid that by using Intel MKL library (or some other) ?GEMM, or, as you say, by writing the operations out in loops. If the matrix is large, the time penalty for the heap allocation should be relatively small, as Steve pointed out earlier, unless it has adverse implications for cache usage in your application. The MKL solution should work best if your machine can gain from threading within MKL, but if you don't care about vectorization, performance probably doesn't interest you.
0 Kudos
jimdempseyatthecove
Honored Contributor III
1,757 Views

Mel,

There is a 3rd option that I use for my Space Elevator simulation. The option is complicated a litte bit due to the fact that OpenMP is involved.

This simulator code is (can be)a memory hog and is very CPU intensive. One of the components within the simulation is a Tether simulated as finite elements of segments and beads (e.g. like a spring with mass at ends). For each bead on the tether (system may have from 8 to many more tethers), about 80 real(8) variables or about 640 bytes of data per bead/segment. A tether may have 10,000 beads (more when makeing high fidelity runs). The number of beads per tether are not the same.

Due to the memory intensiveness of the simulation it is not practicle to maintain a set of scratch temporaries per tether. The route chosen was to have scratch temporaries per thread. Additionaly, some performance enhancements were attained by assigning tethers to threads/processors by way of processor affinity. This means some memory conservation can be attained by having different sized scratch temporaries per thread.

Due to heap allocation having considerable overhead, the scratch temporaries are persistent and dynamically sized if required.

As the simulation begins, when the function requiring the scratch memory is called it calls a function, specifying dimension reqirements, and which function returns a thread dependent pointer to a user typed structure containing pointers to arrays allocated to at least the extent required.

This function call is relatively low weight when the memory previously allocated is sufficient for use with the current requirements. The fast path through the function

get thread number
get pointer(thread number)
test pointer%sizeAllocated
return pointer

The actual code has sanity checks, first time call flag, OpenMP critical section, allocation code if size test fails, etc.. The fast path through is a function call to get the OpenMP thread team member number, get a pointer (integer operation), test an integer, return a pointer. Very low weight when compared to heap allocation.

The only disadvantage of the scheme is you cannot use "size(foo%Array)" to obtain the number of elements in the array that are used.

Jim Dempsey

0 Kudos
Steven_L_Intel1
Employee
1,757 Views
It all depends on what you want. If speed is important, then you want to avoid anything that makes and copies an array temporary. If you want easy to read code and are running out of stack space, use -heap-arrays.
0 Kudos
jimdempseyatthecove
Honored Contributor III
1,757 Views

Eliminating unnecessary temporaries and copies as Steve suggests is important.

You should also examine the data layout, rearrange if necessary, such that the code can use vector (SSEn) instructions. For Fortran you want the fastest varying arrayindex on the left (for C/C++ on the right).

If you have memory available then a simple way to improve performance would be to make the local declaration with SAVE attribute, then on entry check to see if the current array(s) size is(are) sufficient for the current call. If not then deallocate and allocate to new larger size. Once the largest size has been allocated then there will be virtually no overhead.

If you move the array from local declaration to module declaration then clean-up code can run and release unused memory if that is important to you.

Jim Dempsey

0 Kudos
mel23
Beginner
1,757 Views
Ok. Thanks all for these replies.
0 Kudos
Reply