Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.
28381 Discussions

Dynamic array reservation performance question

Karanta__Antti
New Contributor I
1,107 Views

I am investigating a performance issue in our Fortran code. I used VTune to profile the execution and found that about 40% of the execution time was spent in for_allocate and for_deallocate in a certain routine.

The baffling part for me was that there is no explicit ALLOCATE or ALLOCATABLE use in that routine.
These are the only dynamic reservations:

REAL :: U(N),V(N),UU(2*N),VV(2*N) 

where N is an INTEGER parameter to the routine

 

NB: REAL is 8 bytes due to using /4R8 compiler option, although that should not be relevant here.
NB2: I experimented using ALLOCATABLE arrays + explicit ALLOCATE, but that made no difference (as expected).

We are also using option /heap-arrays:16 so arrays under 16kB should be reserved on the stack.
N above varies somewhere between 8...100 so we should be well under the 16k limit.

Stack allocation should be very fast, so I don't know why we are spending so much time in for_allocate and for_deallocate.

I experimented reserving "big enough" arrays using the SAVE attribute (i.e. statically reserved arrays) and the problem went away (as expected).
I can use this as a workaround, but it's a bit ugly. And I would like to understand why we are spending so much time in what should be stack allocation and thus very fast.

What should I look into?


Environment:
ifort 2019.5, x64 target
Windows 10, version 1909

0 Kudos
1 Solution
jimdempseyatthecove
Honored Contributor III
1,003 Views

>>that would not explain why U(100) does not cause heap allocation but U(N) with N<100 does.

*** The following has not been verified, I will let that be an exercise for someone else.

In the U(100) case the allocation size is known at compile time and thus the compiler (at compile time) can (with heap arrays) assess as to if the memory reserve is to come from stack or heap.

Whereas the U(n) case, a runtime check will need to be made.  While the runtime check could possible use the C function alloca (allocate on stack), it is also likely that the compiler developers decided to generate code to either use stack with literal/parameter or malloc/free using heap.

You can test which way (compile time vs runtime size test/method selection) by looking at the assembly code at the procedure entry point.

If the method of heap/stack size selection is made at compile time only, then this should be documented.

 

Jim Dempsey

View solution in original post

0 Kudos
11 Replies
jimdempseyatthecove
Honored Contributor III
1,090 Views

>>The baffling part for me was that there is no explicit ALLOCATE or ALLOCATABLE use in that routine.

What is (likely) happening is you have replaced (nested) DO loops for arrays with array expressions:

arrayOut = ArrayIn operator otherArray

and as such require the Fortran compiler to allocate a temporary for the result prior to the assignment.

The (potential) solution is to replace the array expression with DO loop(s).

 

Jim Dempsey

0 Kudos
Karanta__Antti
New Contributor I
1,088 Views

Hi Jim!

Thanks for the reply! 

I experimented with replacing array expressions w/ do loops previously, sorry for forgetting to mention that. That did cut down on the memory allocation cost, but then I noticed removing the compiler option /assume:dummy_aliases (that was unnecessary for that particular source file) did the same.

The cost that remained after that is the one I asked about, so unfortunately explicit DO loops instead of array expressions did not help here.

Also, the fact that changing the dynamic allocation 

REAL :: U(N),V(N),UU(2*N),VV(2*N) 

(where N is integer procedure parameter)

to

INTEGER, PARAMETER :: STATIC_BUFFER_SIZE = 100
REAL, SAVE, TARGET :: USTATIC(STATIC_BUFFER_SIZE),VSTATIC(STATIC_BUFFER_SIZE),VVSTATIC(2*STATIC_BUFFER_SIZE)

(UU turned out to be unnecessary, TARGET attribute is there as I utilized pointers to allocate a bigger buffer in case 100 is not big enough)

eliminated the issue, without changing any of the actual calculation expressions used. To me this indicates that the array calculation expressions do not anymore have an effect on the memory reservation overhead.

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,082 Views

The change to SAVE should not have made a performance (code) difference except for:

a) if heap arrays were in effect the dynamic allocation would be via heap and SAVE would not have allocation

b) SAVE might eliminate a alignment test (peel) for DO loops. You could correct this in dynamic allocation by requesting alignment of the dynamic variables.

Jim Dempsey

0 Kudos
Karanta__Antti
New Contributor I
1,062 Views

Hi Jim!

I managed to narrow the issue down further. The SAVE attribute is not necessary. If I reserve

INTEGERPARAMETER :: BUFFER_SIZE = 100
REAL, TARGET :: UFIXED(BUFFER_SIZE),VFIXED(BUFFER_SIZE),VVFIXED(2*BUFFER_SIZE)

Then there is no for_allocate / for_deallocate overhead observed. 

However, the 40% performance penalty from for_allocate / for_deallocate is again observed when I just change the above to 

REALTARGET :: UFIXED(N),VFIXED(N),VVFIXED(2*N)

where N is a procedure input parameter and always < 100 for the data set I used in the measurement.

I.e. the buffers are smaller than in the case above where the buffer size is known at compile time. This is a bit unexpected. Could this be a compiler bug?

I now also tested this w/ ifort 2020.4 (Windows, x64 target). Same result.

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,048 Views

>>Then there is no for_allocate / for_deallocate overhead observed. 

This implies that heap arrays is in effect. Check your compiler options.  It is unknown as to if you need this feature enabled or disabled. If for other reasons you need the option enabled, then for your (known to be small allocation) subroutines, you can disable this option (see link).

IIF heap arrays is not on by default, you can explicitly disable this feature (see link)

>>Could this be a compiler bug?

Verify by configuring your code using (N) and insert code to print out the compiler options see COMPILER_OPTIONS.

and/or

See if heap arrays is enabled by a configuration file. See: Using Configuration Files

Jim Dempsey

 

0 Kudos
Karanta__Antti
New Contributor I
1,017 Views

Here's the list of used compiler options as reported by using COMPILER_OPTIONS() from ISO_FORTRAN_ENV. There's quite a few as you can see. I did not remove any, but inserted some newlines to group them a bit more readably. I did not change the order (in case that is significant).

/stand:f18 /Qdiag-disable:5112 /Qdiag-disable:7346 /Qdiag-disable:7355 /Qdiag-disable:5268 /Qdiag-disable:6233 /Qdiag-disable:7163 /Qdiag-disable:7162 /Qdiag-disable:6916 /Qdiag-disable:7342 /Qdiag-disable:7416 /Qdiag-disable:8576 /Qdiag-disable:8044 /Qdiag-disable:8810 /Qdiag-disable:7343 /Qdiag-disable:7026 /Qdiag-disable:7357 /Qdiag-disable:8208 /Qdiag-disable:5142 /Qdiag-disable:7350 /Qdiag-disable:6473 /Qdiag-disable:7165 /Qdiag-disable:6031 /Qdiag-disable:5182 /Qdiag-disable:7344 /Qdiag-disable:6923 /Qdiag-disable:6103 /Qdiag-disable:7925 /Qdiag-disable:7359 /Qdiag-disable:6477 /Qdiag-disable:7352 /Qdiag-disable:7025 /Qdiag-disable:6033 /Qdiag-disable:6028 /Qdiag-disable:7349 /Qdiag-disable:7427 /Qdiag-disable:7320 /Qdiag-disable:7334 /Qdiag-disable:7023 /Qdiag-disable:7374 /Qdiag-disable:6893 /Qdiag-disable:8889 /Qdiag-disable:8872 /Qdiag-disable:8873 /Qdiag-disable:8869 /Qdiag-disable:8891 /Qdiag-disable:8871 /Qdiag-disable:cpu-dispatch /Qdiag-error:6075 /Qdiag-error:6956 /Qdiag-error:5117 /Qdiag-error:6717 /Qdiag-error:6192 /Qdiag-error:6188 /Qdiag-error:6931 /Qdiag-error:6297 /Qdiag-error:6179 /Qdiag-error:7876 /Qdiag-error:7706 /Qdiag-error:6187 /Qdiag-disable:7712 /Qdiag-disable:5462 /Qdiag-disable:8291 /Qdiag-disable:8290

/recursive /Qdiag-disable:7762 /Gm /Qauto /4Ya /fpscomp:logicals /4Yd

/heap-arrays:16

/fpconstant /threads /libs:dll /fpp /nolink /nologo /free

/include:c:\work\napa-newcompiler\napa\sys /module:c:\work\napa-newcompiler\_build\x64\napa\release\nochecks\modules /Foc:\work\napa-newcompiler\_build\x64\napa\release\nochecks\objectfiles\ /DNTDDI_VERSION=0x06030000 /D_WIN32_WINNT=0x0603 /DWINVER=0x0603 /D_CRT_SECURE_NO_WARNINGS /DSTRICT /DWIN32 /D_WIN32 /DUNICODE /D_UNICODE /DMOTIFAPP /DXMSTATIC /D_napa_ /D_TODO_REVIEW_NEXT_ONES_ /D_WIN32_WINDOWS=0x0601 /D_COM_NO_STANDARD_GUIDS_ /DWINNT=0x0601 /DWINDOWS /D_WIN32_ /DUSE_INIT /DNAPA_64BIT /DNDEBUG

/warn:all /nogen-interfaces
/assume:realloc_lhs /debug:minimal /fp:strict /Qftz /align:dcommons /4I8 /4R8

So, we are using heap-arrays with parameter 16, so arrays smaller than 16k should be on the stack, right? Which is clearly the case both when allocating style U(N) or U(100) (and remembering N < 100).

 

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,013 Views

I would first suggest turning heap arrays off (at least for your problematic subroutine). Then run a test checking the results of your problematic subroutine (ignoring any adverse effect elsewhere).

IIF this removes the allocation issue with your subroutine (without crashing the program), then this would indicate that the 16 argument is likely in bytes as opposed to KB. In this case try using 16384 for 16KB. *** Note, at one point the size value was not used as a threshold, rather 0==off and ~0==on.

IIF the program crashes, then restore the heap arrays option as it is (was) required to avoid stack overflow issues ...but then configure your build/solution to remove heap arrays from this problematic subroutine. If you are using MS VS, you can right-click on this source file in the Solution Explorer then traverse down the Properties to locate the options to turn off heap arrays. If your are using a make file, then you will have to add a rule for this file without the heap arrays enabled.

Jim Dempsey

0 Kudos
Karanta__Antti
New Contributor I
1,007 Views

Ok, I'll do some experimenting after weekend and post the results then.

Still, even if the problem stems from /heap-arrays parameter being bytes instead of kilobytes (the documentation page clearly says its kilobytes https://software.intel.com/content/www/us/en/develop/documentation/fortran-compiler-oneapi-dev-guide-and-reference/top/compiler-reference/compiler-options/compiler-option-details/advanced-optimization-options/heap-arrays.html) that would not explain why U(100) does not cause heap allocation but U(N) with N<100 does.

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,004 Views

>>that would not explain why U(100) does not cause heap allocation but U(N) with N<100 does.

*** The following has not been verified, I will let that be an exercise for someone else.

In the U(100) case the allocation size is known at compile time and thus the compiler (at compile time) can (with heap arrays) assess as to if the memory reserve is to come from stack or heap.

Whereas the U(n) case, a runtime check will need to be made.  While the runtime check could possible use the C function alloca (allocate on stack), it is also likely that the compiler developers decided to generate code to either use stack with literal/parameter or malloc/free using heap.

You can test which way (compile time vs runtime size test/method selection) by looking at the assembly code at the procedure entry point.

If the method of heap/stack size selection is made at compile time only, then this should be documented.

 

Jim Dempsey

0 Kudos
Karanta__Antti
New Contributor I
945 Views

Both removing the /heap-arrays parameter altogether and using /heap-arrays:16384 as you suggested seem to resolve the issue. I did not read the assembly but measured the execution w/ VTune and for_allocate and for_deallocate did not come up anymore.

So it could be that /heap-arrays numeric parameter is actually in bytes instead of kilobytes. Which would IMO be a big bug in the documentation.

I think I'll resolve the issue by removing the /heap-arrays parameter from our build. 

Thanks for the help, Jim!

0 Kudos
Steve_Lionel
Honored Contributor III
920 Views

The value in /heap-arrays is near-useless - there are very few situations where it is considered. Either use 0 or don't use the option at all.

Reply