I have some medium sized 3d arrays (approx 500x500x10) for which I sometimes need to copy a section of the array to a different (non-intersecting) section of the same array. E.g.:
a(:,:,1:3) = a(:,:,4:6)
I was surprised to find out that a temporary copy is made (inferred from the stack overflow I get which I don't get if the last subscripts are constants) in this case since it's trivial to confirm that the sections do not overlap. The sollution is then to use a loop, but I find that "unattractive" compared to using the array section language feature.
Is it possible to avoid a copy in some other way?
And another, related question: if the subscript bounds are not literal constants but (scalar) variables I find that not even the loop approach works, i.e.
j = 3
DO i = 1,3
a(:,:,i) = a(:,:,i+j)
also produce a stack overflow. I fail to see what could possibly require the temporary copy in this case... Is it possible to circumvent that in this case?
I know that I can increase the stack size to fix the overflow issue, but performance is somewhat important and so i really would like to minimize unnecessary copying.
Thank you Ian. I forgot to mention that I am allocating the space for B within the subroutine, and therefore overlapping sections would not be possible, which in theory could be seen by the compiler. However, since I am not that experienced in Fortran I might be overseeing something.
Interestingly, using ASSOCIATE as mentioned by FortranFan seems to work not only with arrays but with pointers as well. If I replace
A(from:to) = B(from:to)
associate(lhs => A(from:to), rhs => B(from:to)) lhs = rhs end associate
I also don't run into stack overflows. I understand that this might be dangerous of A and B actually overlap but is there any explanation for this behavior?
The lack of a temporary for the code in #21 is probably a consequence of a compiler bug. While the associate things might not be pointers, they still have the TARGET attribute (F2008 126.96.36.199), so aliasing is still permitted and the compiler has to create temporaries or otherwise detect that there's no overlap between the left and right hand sides. Here's a simple test that demonstrates the problem Steve was talking about in #18.
PROGRAM p IMPLICIT NONE INTEGER, POINTER :: a(:), b(:) ALLOCATE(a, SOURCE=[1,2,3,4,5]) PRINT "(*(I0,:,','))", a b => a(5:1:-1) ASSOCIATE(a2 => a, b2 => b) b2 = a2 END ASSOCIATE PRINT "(*(I0,:,','))", a DEALLOCATE(a) END PROGRAM p
>ifort /check:all /warn:all /standard-semantics "2016-11-26 assoc.f90" Intel(R) Visual Fortran Intel(R) 64 Compiler for applications running on Intel(R) 64, Version 17.0 Build 20161005 Copyright (C) 1985-2016 Intel Corporation. All rights reserved. Microsoft (R) Incremental Linker Version 14.00.24215.1 Copyright (C) Microsoft Corporation. All rights reserved. "-out:2016-11-26 assoc.exe" -subsystem:console "2016-11-26 assoc.obj" >"2016-11-26 assoc.exe" 1,2,3,4,5 1,2,3,2,1
In terms of the first paragraph of #21, for the compiler (proper) to be able to realise that an object that references an object created by an allocate statement cannot be aliased to some other pointer later in the code, it requires the compiler to do some reasonably advanced reference tracking (for a Fortran context), and if the compiler ever loses visibility of where a reference comes from, that tracking is not possible. I have no idea what the Intel compiler's capability is like in this area, and anyway, the actual code would be required to see how difficult the required reference tracking is - in #19 as shown the references come in via dummy arguments, so the compiler hasn't got a clue as to their history.
But if the programmer knows that aliasing is not possible, they can always make that obvious to the compiler by passing the relevant variables to a procedure where the dummy arguments do not have the POINTER or TARGET attributes, and then doing the operations in that procedure.
Intel(R) Visual Fortran Compiler 188.8.131.52 [Intel(R) 64] for Windows
Stack overflow on the following line:
arr_to(reordering(1:N)) = arr_from(1:N)
all three arrays are declared allocatable in their respective modules and included with the `use ...` statement.
Is temporary array expected in this case? Or is it something compiler does that it was not supposed to do?
Things were working fine with up to 18th version of the compiler.
Can you provide a small but complete example? A vector subscript on the left side might have duplicate values causing a temp to be needed unless the assignment was done sequentially. The temp might end up being faster if it could be vectorized. Without buildable example, it's impossible to know for sure.
For post #25, are you compiling with /heap-arrays?
Note, the [:size] optional argument appears to be non-functional, so omit.
Also note, that if you desire temporary objects on stack for all other source files, that in MS Visual Studio, you can select the /heap-arrays as a compiler option for the single source file causing the problem. And leave all other source files with default options.
I am sorry Steve and Jim I did not see your posts earlier. Admin approval of my messages takes too much time on this forum so I have stopped checking this thread.
I had reported issue in #25 through Online Service Center and it was escalated to compiler developers.
The problem can be illustrated with this simple piece of code: https://godbolt.org/z/utZ6JF Note, vectorization is turned off with the compiler command line switch.
When you look at the disassembly code, line 189 corresponds to the assignment on line 12 of the source code. So, I figured register R15 holds pointer to the first element of ORDER array. Then, on lines 205-206 this array is copied somewhere.
1. This somewhere is allocated on stack and if the array size is large enough the stack allocation will crash the application. I don't want to use /heap-arrays switch because why would I want to hide incorrect code generation?
2. In addition to the potentially unrecoverable crash, the issue is detrimental to the performance. Basically, whenever you have such indirect assignment, you may expect to spend almost twice as much time doing the job because of additional array copy. In this case, we are lucky to have stack overflow in addition to the array copy so we could analyze the problem and postpone migration to 2019 compiler. In fact, I believe the problem first appeared in one of the updates for Intel 2018 compiler. Because of that, we have to stick to 2017 version.
I have issue with your program listed on godbolt:
line 6: N = 100000
line 8: allocate(order(N),...
line 10: N=order(987) ! *** N now has trash (uninitialized data from allocation)
IOW the trash data may result in N receiving a value larger than size(order) and thus contribute to crash at line 16
line 16: arr_to(order(1:N)) = arr_from(1:N)
That line has nothing to do with allocation of temporary arrays and call to _intel_fast_memcpy on array ORDER. I need that assignement for illustration only or otherwise compiler optimize N out of the assembly.
Here is the same example but without undefined behavior. https://godbolt.org/z/l0Rl8R
_intel_fast_memcpy is on line 217 and R15 register points to array ORDER.
Change the implied loop to explicit loop:
do i=1,N arr_to(order(i)) = arr_from(i) end do
; loop to move two real(8)'s ..B2.13: # Preds ..B2.11 ..B2.13 ; obtain next two indexes into r8 and r10 movsxd r8, DWORD PTR [rbp+r15*8] #45.28 movsxd r10, DWORD PTR [4+rbp+r15*8] #45.28 ; increment the # pairs counter inc r15 #44.5 ; obtain the two arr_from values mov rsi, QWORD PTR [rcx+r14] #45.9 mov r9, QWORD PTR [8+rcx+r14] #45.9 ; advance the next arr_from index add rcx, 16 #44.5 ; insert the two arr_to values (one using r8 and index, the other r10) mov QWORD PTR [-8+r13+r8*8], rsi #45.9 mov QWORD PTR [-8+r13+r10*8], r9 #45.9 ; check for move by two completion cmp r15, rdx #44.5 jb ..B2.13 # Prob 63% #44.5 ;; not done moving ; move by two is done ; check for move odd number (iow remainder) lea esi, DWORD PTR [1+r15+r15] #45.9 ..B2.15: # Preds ..B2.14 ..B2.11 lea edx, DWORD PTR [-1+rsi] #44.5 cmp edx, eax #44.5 jae ..B2.17 # Prob 2% #44.5 ;; no remainder ; have remainder ; produce index into order movsxd rsi, esi #45.28 ; fetch order(i) movsxd rdx, DWORD PTR [-4+rbp+rsi*4] #45.28 ; fetch arr_from(i) mov rax, QWORD PTR [-8+r14+rsi*8] #45.9 ; store last arr_to mov QWORD PTR [-8+r13+rdx*8], rax #45.9 ..B2.17: # Preds ..B2.15 ..B2.10 ..B2.16
No stack temporary
jimdempseyatthecove (Blackbelt) wrote:
Change the implied loop to explicit loop:do i=1,N arr_to(order(i)) = arr_from(i) end do
No stack temporary
I know how to fix the crash. My intention is to help fellow developers who may stumble upon such behavior wrongfully assuming that compiler will generate equivalent code for implied and explicit loops.