Solved: Apparent increase in stack memory usage following upgrade to ifort 2017u6

Craig_T_ · ‎04-11-2018

My organization has recently upgraded the Intel compiler we use for our software from Intel Fortran 2013 SP1 update 3 to Intel Fortran 2017 update 6.

However, we have observed that our software now encounters from stack overflows for both non-optimized debug and optimized release runs of the ifort2017u6 build, whereas corresponding ifort2013u3 builds both run without any stack overflow encountered. These builds are using the same source code and compiler settings were set to be effectively similar.

This is concerning to us, since we do have a deliberately capped /STACK reserve size for our software and we don't want to be unable to run any simulations that previously ran.

I was also able to corroborate an increase in stack memory usage using the new compiler by comparing the loc() address of an integer at the top of the call stack with a loc() of a local further down. If I compare the stack memory address for the routine that encounters the overflow, I see it corresponds to about 15% more stack memory usage.

I'm seeking to learn about any programming styles that may cause the new compiler's stack memory usage to swell compared to the older 2013 SP1 update 3 compiler, and/or if there are any Intel tools that may help troubleshoot this change in stack memory usage problem.

jimdempseyatthecove · ‎04-12-2018

One cause of additional stack consumption are expressions that now use (stack) temporary arrays where formerly they did not.

Some of this came in when the new default behavior was to enable standard-realloc-lhs. (aka assume realloc_lhs). Try adding option nostandard-realloc-lhs. (aka assume norealloc_lhs).

If this doesn't work then you may have to find the statement causing the problem and change the code from using an implied do loop into using an explicit do loop.

Or increase your default stack size (linker option)

Jim Dempsey

View solution in original post

andrew_4619 · ‎04-11-2018

Have you read the topic on heap-arrays? https://software.intel.com/en-us/node/678037

jimdempseyatthecove · ‎04-12-2018

One cause of additional stack consumption are expressions that now use (stack) temporary arrays where formerly they did not.

Some of this came in when the new default behavior was to enable standard-realloc-lhs. (aka assume realloc_lhs). Try adding option nostandard-realloc-lhs. (aka assume norealloc_lhs).

If this doesn't work then you may have to find the statement causing the problem and change the code from using an implied do loop into using an explicit do loop.

Or increase your default stack size (linker option)

Jim Dempsey

Craig_T_ · ‎04-12-2018

Thanks Jim Dempsey and Andrew for the prompt posts! They are both helpful in honing in.

Jim Dempsey: I had already been using /nostandard-realloc-lhs, so I guess this difference is only related to more aggressive stack utilization for implied do loops? It's a shame this sort of thing isn't flagged such that Intel leaves its customers in quagmires. Note of course it's no small task to ask the users to consider all implied do's in their code just to restore the previous balance of efficient stack usage and stack overflow protection they hard-fought for previously.

Andrew: I see great value in the heap arrays setting you pointed at as a means of providing the overflow protection I am seeking. This said, while it can very readily provide an increase to global stack overflow protection throughout our code, it may do so at the expense of efficient stack memory run-time performance in various areas of our code. Ideally we would not be forced into such win/loss choices with just a compiler upgrade.

Both old and new compiler builds in my case were built with the Fortran>Optimization>Heap Arrays setting left blank. I wonder if perhaps the blank value (which strangely is not explicitly documented as meaning the value is infinite) had some large "default" setting in the 14.0.3.202 that was changed in 17.0.6.270? If so, knowledge of this specific default value in 14.0.3.202 (or the means to derive it) would be very valuable.

Craig

Lorri_M_Intel · ‎04-12-2018

I guess you could consider "not setting" heap-arrays to be the same as /heap_arrays:infinity In any case, there is no "default huge value" that could have changed between 14.0 and 17.0

I agree with Jim that there is likely to be a statement that is using stack in 17.0, when previously it was not. I'm not immediately suspicious of implied do loops, but perhaps Jim has seen this one before?

Anyway - if you add /check:arg_temp_created to the command line, does it give any messages?

--Lorri

Craig_T_ · ‎04-12-2018

Yep, I actually checked with /check:arg_temp_created already and it didn't reveal anything. It is true, though, that the stack overflow is occurring for a Fortran code line involving what I've gathered is ~~an aforementioned "implied do"~~ a loop without an explicit do statement.

Thanks Lorri for clarifying more directly than the documentation that a lack of heap-arrays setting should be equivalent to /heap_arrays:inf. If this is true than I'm just about certain the 14.0 compiler must not have been putting the implied do causing the exception on the stack since the array in question is so large it exceeds the entire stack reserve size.

My current plan is to exercise some tests using fairly large values for heap-arrays in hopes that run-time penalty is limited. This is of course is extremely difficult to conclude generally and I wish Intel was more considerate in its compiler development about these sorts of changes, but at least with the help of everybody here there this is a course forward.

Thanks again everybody.

andrew_4619 · ‎04-12-2018

As a matter of interest is it possible to post a snipped that gives the "offending" code that creates the temp? As an aside I am not that convinced you will see a highly significant difference in the overall performance between some temps on the stack or the heap. Code changes that eliminate the creating of large temps is likely to have a big impact.

Craig_T_ · ‎04-12-2018

The "offending code" has this form:

TYPE(dtype_type)
real,allocatable :: real_array(:)
END TYPE
type(dtype_type),allocatable :: dtype
use mod, only : real_heap_scalar,dtype
real,allocatable :: real_heap_array(:)
integer,allocatable :: isort_heap_array(:)
real :: real_local_scalar

dtype(i)%real_array=real_heap_array(isort_heap_array(:n)+1)*real_local_scalar+real_heap_scalar

Just take my word for it that it's programmed correctly with regards to allocations :).

I agree that when it's cornered explicit do loops are preferable, but it's not really reasonable for me to detect all places in our code ~~implied do loops~~ non-explicit do loops are used and check for this possibility.

jimdempseyatthecove · ‎04-12-2018

Just for kicks, try:

dtype(i)%real_array(:n)=real_heap_array(isort_heap_array(:n)+1)*real_local_scalar+real_heap_scalar

Make any adjustments for 0-based or 1-based real_array

Jim Dempsey

IanH · ‎04-12-2018

The issue is more likely to be the vector indexing (with intermediate addition operation) of real_heap_array.

"Implied do" has a reasonably specific meaning within the Fortran language - `(expr, idx=start, finish)`. There are no implied do's in that example statement. The statement includes an array operation (one that happens to also include vector indexing - `array(some_other_array)`.

jimdempseyatthecove · ‎04-13-2018

Vector indexing, as above, with isort_heap_array(:n) containing a very large number of cells, will require 1x or 2x the size of the array for temporaries. The occurrences of these statements may be small, and manageable to rewrite.

Jim Dempsey

Craig_T_ · ‎04-13-2018

IanH - thanks for the language correction. You are correct that I was mis-using "implied do" when I was more colloquially meaning a compiler loop without an explicit do statement in code.

It occurred to me overnight that while /heap_arrays is a nice blunt instrument for stack overflow protection, it's quite likely that since Lorri Menard clarified that the compiler behavior without any /heap_arrays option was equivalent to /heap_arrays:infinity (or an always-put-on-stack rule) there is probably some new extra overhead being added by simply the option itself. By this I mean, even if 100% of the temporary array pivots are kept on the stack, just having the option seems like it would slow down the code? This overhead would be added to every temporary array producing bit of code, regardless of its size. Lorri Menard, could you confirm?

Thanks for the comment Jim Dempsey, though respectively I'm working with roughly 1750 source files and 22,000 procedures so I don't quite agree about the manageability. Correct me if I'm wrong, but I don't think all variants of Fortran code syntax which could produce vector indexing by the compiler could be precisely corralled with a regex (without a good number of "false positives")?

jimdempseyatthecove · ‎04-13-2018

Craig,

Are you aware that in Microsoft Visual Studio, that if you use the Solution Explorer and right-click on the file containing the isort_heap_array statement exhibiting the error, then select properties, that you can specify heap arrays options for that specific file. (and do this to other specific files as required).

Jim Dempsey

andrew_4619 · ‎04-13-2018

"....some new extra overhead being added by simply the option itself. ". For the most part the stack/heap decision for temps must surely be at compile time not execution. Rather than speculate on efficiency run some tests, that should be quite easy to do?

Craig_T_ · ‎04-13-2018

andrew_4619

"For the most part the stack/heap decision for temps must surely be at compile time not execution"....since at compile-time the size of the temporary array is not known, then wouldn't this need to be a run-time decision? My point in #12 is that when /heap-arrays is not set at all, no run-time decision needs to be made, whereas this can't be the case if /heap-arrays is set to some user-set threshold.

"Rather than speculate on efficiency run some tests, that should be quite easy to do?"
I'm no stranger to the potential fruitlessness of taking intuition too far, but suggesting tests (I assume you mean benchmark tests) are easy to do is a pretty laughable comment. In my experience contrived little codes have differed even in run-time trend from the same code in a whole-code context because of a multitude of ways the larger code could influence run-times, and moreover whole-code benchmarking...even run several times on an otherwise idle machine running with high priority results is fairly noisy run-times. This is particularly true for our SSE2 win64 build. [Be aware my organization cares about speedups or slowdowns that are perhaps smaller than you are used to.]

To suggest that makes more sense for clients to go through all this effort for situations when Intel can just confirm truths about their compiler behavior (which granted isn't always, but should be easy in this case) seems bizarre to me.

Jim Dempsey, yes I was aware that the setting is granular to the source-file level like other compiler options; I hope it's still recognized this prospective task determining places stack usage may have parasitically increased is a very burdensome task.

Lorri_M_Intel · ‎04-13-2018

Something weird just happened when I tried to reply ... if you see this twice, sorry.

RE:
It occurred to me overnight that while /heap_arrays is a nice blunt instrument for stack overflow protection, it's quite likely that since Lorri Menard clarified that the compiler behavior without any /heap_arrays option was equivalent to /heap_arrays:infinity (or an always-put-on-stack rule) there is probably some new extra overhead being added by simply the option itself. By this I mean, even if 100% of the temporary array pivots are kept on the stack, just having the option seems like it would slow down the code? This overhead would be added to every temporary array producing bit of code, regardless of its size. Lorri Menard, could you confirm?

I'm not really sure how to answer this ... let me start with I think you misunderstood my comment saying that you could consider not specifying heap-arrays as being the same as saying /heap-arrays:infinity. Really, there is a difference in the compiler behavior when /heap-arrays is EXPLICITLY specified, and maybe I shouldn't have said you could look at it that way.

It is not the case that the size is checked at runtime and then either temps are put on the stack or put on the heap. From the documentation on /heap-arrays:

If size is specified, the value is only used when the total size of the temporary array or automatic array can be determined at compile time, using compile-time constants. Any arrays known at compile-time to be larger than size are allocated on the heap instead of the stack. For example, if 10 is specified for size:

All automatic and temporary arrays equal to or larger than 10 KB are put on the heap.

All automatic and temporary arrays smaller than 10 KB are put on the stack.

If size is omitted, and the size of the temporary array or automatic array cannot be determined at compile time, it is assumed that the total size is greater than size and the array is allocated on the heap.

As Andrew said, the decision is made at compile time.

Does this help clarify?

--Lorri

Craig_T_ · ‎04-13-2018

First, thanks for the prompt replies everybody - I really appreciate it.

Lorri - I read the help and I understand what /heap_arrays:n does with regards to moving temporary data to either stack or heap. The help is clear on that. My comment 12 is reading between the lines regarding the run-time implication of now asking the compiler to use different instructions to handle the temporary arrays differently based on array size. Your previous comment did indicate that when no /heap_arrays is set the compiler will always put temps on the stack unconditionally, so I'd imagine (I don't know really, but I was hoping you could say definitively) that when no /heap_arrays is set there is no assembly code generated to perform a size test or for that matter ever put temporary arrays on the heap. Consistent with this suspicion, I have observed that the binary size of my .exe increased by several Mb when the one and only change made is to compile newly with /heap_arrays:4000 instead of no /heap_arrays setting at all. This byte difference I suspect is because of the new instructions added to handle the run-time decisions for temp array memory location (based on the compile-time known set threshold). [Also note that until you clarified that there isn't a hidden maximum temporary array size for the stack that's always used...even when /heap_arrays isn't used, one could not draw such a conclusion from Intel documentation alone.]

When I say in comment 12 that the size of the temp is not known at compile-time, I'm referring to the particular temporary array being generated for the particular line of source code for the particular simulation. I'm not sure, but maybe comment #16 and #14 mistakenly thought I was referring to the threshold itself? I find both comments #16 and #14 confusing because they seem to suggest run-time equivalence of, say, no /heap_arrays argument vs. /heap_arrays:999999999. If I'm mistaken and #16 and #14 are not off-base, this would be good news (though I'd confess I don't know how the compiler is able to do it).

Anyway, I've started running some tests. Assuming I'm correct and adding a /heap_arrays:n (even if very large size) is not a good solution for me because of the added overhead, I remain interested narrowing the criteria by which we know the compiler upgrade will cause more stack usage (or other ideas).

andrew_4619 · ‎04-13-2018

" but suggesting tests (I assume you mean benchmark tests) are easy to do is a pretty laughable comment. " Your laughter is based on incorrect assumptions.

"Anyway, I've started running some tests. ", I guess you worked it out then,,,,,

jimdempseyatthecove · ‎04-14-2018

Craig,

Another thought came to mind, it may not apply to you.

At some point in the past versions the default behavior for subroutine/function local arrays was to have them as SAVE (when compiled without recursive and/or openmp). Due (my guess) to the number of problem reports when users converted to OpenMP (and/or Fortran standard changes), the default behavior was to make these "automatic" (i.e. stack allocated).

IIF your situation is that:

a) You are (were) not compiling with recursive or openmp
b) You have (had) very large subroutine/function local arrays in a few places

Then for those few places, compile those few sources with /Qsave option (-save on Linux)

It would be advisable to also insert debug expandable code to assert that the subroutine/function was not called from a parallel region. And put a copious comment as to make the offending array as allocatable or to use heap arrays. This will not affect subroutine/function was called from a serial region and then used as a shared array from parallel region in/below current scope..

Jim Dempsey

Lorri_M_Intel · ‎04-16-2018

If you do not specify /heap-arrays, there is no code generated to check the size of a temporary array, it is simply allocated on the stack.

If you specify /heap-arrays, the *compiler*, at compile time, checks the size of the temporary array, and if it will be less than the threshold, allocates it on the stack. If it will be greater, or **if the compiler cannot tell at compile time**, it allocates the memory from heap.

As I said in #16, the decision is NOT made at runtime.

Calling our allocate routine does generate more instructions than just growing the stack. It's also possible that if the calls to the allocate routine are within a hot-spot, that the optimizer would not be able to be as aggressive.

Still, 4MB is a pretty big growth.

Is there any chance that you're comparing RELEASE and DEBUG configurations? Or, that you also set a /check option?

Craig_T_ · ‎04-16-2018

Benchmark results when /heap-arrays:n was set with n a huge value were dramatically slower (14-27%) than when /heap-arrays was omitted outright. This is consistent with the notion that usage of /heap-arrays adds overhead to the code that did not exist before because the compiler required no tests.

No, the 4Mb difference was from only adding /heap-arrays. The 4Mb figure is from a release solver without any /check options. Do note this is a small percent though (between 1-2% of release exe).

Outstanding question: Are there any diagnostic tools (ideally static/non-run-time ones) or techniques that can be used to isolate code that is newly more stack-usage intensive than with ifort 14.0?

I recommend that when Intel makes a compiler change like this in the future, they maintain some flag to maintain the previous behavior. Hopefully this thread is clear enough to communicate that failure to do this positions customers in situations where compiler upgrade is unattractive.