Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.

copy an array section within an array without temporary copy

Espen_M_
Beginner
2,725 Views

I have some medium sized 3d arrays (approx 500x500x10) for which I sometimes need to copy a section of the array to a different (non-intersecting) section of the same array. E.g.:

[fortran]

a(:,:,1:3) = a(:,:,4:6)

[/fortran]

I was surprised to find out that a temporary copy is made (inferred from the stack overflow I get which I don't get if the last subscripts are constants) in this case since it's trivial to confirm that the sections do not overlap. The sollution is then to use a loop, but I find that "unattractive" compared to using the array section language feature.

Is it possible to avoid a copy in some other way?

And another, related question: if the subscript bounds are not literal constants but (scalar) variables I find that not even the loop approach works, i.e.

[fortran]

j = 3

DO i = 1,3

    a(:,:,i) = a(:,:,i+j)

END DO

[/fortran]

also produce a stack overflow. I fail to see what could possibly require the temporary copy in this case... Is it possible to circumvent that in this case?

I know that I can increase the stack size to fix the overflow issue, but performance is somewhat important and so i really would like to minimize unnecessary copying.

0 Kudos
31 Replies
Steven_L_Intel1
Employee
536 Views

Inside an ASSOCIATE construct, the associate-names don't have the POINTER attribute even if the selectors do. (Same for ALLOCATABLE).

0 Kudos
IanH
Honored Contributor II
536 Views

The lack of a temporary for the code in #21 is probably a consequence of a compiler bug.  While the associate things might not be pointers, they still have the TARGET attribute (F2008 8.1.3.3), so aliasing is still permitted and the compiler has to create temporaries or otherwise detect that there's no overlap between the left and right hand sides.  Here's a simple test that demonstrates the problem Steve was talking about in #18.

PROGRAM p
  IMPLICIT NONE
  
  INTEGER, POINTER :: a(:), b(:)
  
  ALLOCATE(a, SOURCE=[1,2,3,4,5])
  PRINT "(*(I0,:,','))", a
  b => a(5:1:-1)
  
  ASSOCIATE(a2 => a, b2 => b)
    b2 = a2
  END ASSOCIATE
  
  PRINT "(*(I0,:,','))", a
  
  DEALLOCATE(a)
END PROGRAM p


 

>ifort /check:all /warn:all /standard-semantics "2016-11-26 assoc.f90"
Intel(R) Visual Fortran Intel(R) 64 Compiler for applications running on Intel(R) 64, Version 17.0 Build 20161005
Copyright (C) 1985-2016 Intel Corporation.  All rights reserved.

Microsoft (R) Incremental Linker Version 14.00.24215.1
Copyright (C) Microsoft Corporation.  All rights reserved.

"-out:2016-11-26 assoc.exe"
-subsystem:console
"2016-11-26 assoc.obj"

>"2016-11-26 assoc.exe"
1,2,3,4,5
1,2,3,2,1

 

In terms of the first paragraph of #21, for the compiler (proper) to be able to realise that an object that references an object created by an allocate statement cannot be aliased to some other pointer later in the code, it requires the compiler to do some reasonably advanced reference tracking (for a Fortran context), and if the compiler ever loses visibility of where a reference comes from, that tracking is not possible.  I have no idea what the Intel compiler's capability is like in this area, and anyway, the actual code would be required to see how difficult the required reference tracking is - in #19 as shown the references come in via dummy arguments, so the compiler hasn't got a clue as to their history.

But if the programmer knows that aliasing is not possible, they can always make that obvious to the compiler by passing the relevant variables to a procedure where the dummy arguments do not have the POINTER or TARGET attributes, and then doing the operations in that procedure.

0 Kudos
Steven_L_Intel1
Employee
536 Views

The ASSOCIATE issue Ian notes in #23 has been escalated as issue DPD200416244. I have no new information on the original report for this thread.

0 Kudos
Andriy
Beginner
536 Views

 Intel(R) Visual Fortran Compiler 19.0.1.144 [Intel(R) 64] for Windows

Stack overflow on the following line:

arr_to(reordering(1:N)) = arr_from(1:N)

all three arrays are declared allocatable in their respective modules and included with the `use ...` statement.

Is temporary array expected in this case? Or is it something compiler does that it was not supposed to do?

Things were working fine with up to 18th version of the compiler.

0 Kudos
Steve_Lionel
Honored Contributor III
536 Views

Can you provide a small but complete example? A vector subscript on the left side might have duplicate values causing a temp to be needed unless the assignment was done sequentially. The temp might end up being faster if it could be vectorized. Without  buildable example, it's impossible to know for sure.

0 Kudos
jimdempseyatthecove
Honored Contributor III
536 Views

Andy,

For post #25, are you compiling with /heap-arrays?

Note, the [:size] optional argument appears to be non-functional, so omit.

Also note, that if you desire temporary objects on stack for all other source files, that in MS Visual Studio, you can select the /heap-arrays as a compiler option for the single source file causing the problem. And leave all other source files with default options.

Jim Dempsey

0 Kudos
Andriy
Beginner
536 Views

I am sorry Steve and Jim I did not see your posts earlier. Admin approval of my messages takes too much time on this forum so I have stopped checking this thread.

 

 

I had reported issue in #25 through Online Service Center and it was escalated to compiler developers.

 

The problem can be illustrated with this simple piece of code: https://godbolt.org/z/utZ6JF Note, vectorization is turned off with the compiler command line switch.

When you look at the disassembly code, line 189 corresponds to the assignment on line 12 of the source code. So, I figured register R15 holds pointer to the first element of ORDER array. Then, on lines 205-206 this array is copied somewhere.

1. This somewhere is allocated on stack and if the array size is large enough the stack allocation will crash the application. I don't want to use /heap-arrays switch because why would I want to hide incorrect code generation?

2. In addition to the potentially unrecoverable crash, the issue is detrimental to the performance. Basically, whenever you have such indirect assignment, you may expect to spend almost twice as much time doing the job because of additional array copy. In this case, we are lucky to have stack overflow in addition to the array copy so we could analyze the problem and postpone migration to 2019 compiler. In fact, I believe the problem first appeared in one of the updates for Intel 2018 compiler. Because of that, we have to stick to 2017 version.

 

 

0 Kudos
jimdempseyatthecove
Honored Contributor III
536 Views

I have issue with your program listed on godbolt:

line 6: N = 100000
line 7:
line 8: allocate(order(N),...
line 9:
line 10: N=order(987) ! *** N now has trash (uninitialized data from allocation)

IOW the trash data may result in N receiving a value larger than size(order) and thus contribute to crash at line 16

line 16: arr_to(order(1:N)) = arr_from(1:N)

Jim Dempsey

0 Kudos
Andriy
Beginner
536 Views

That line has nothing to do with allocation of temporary arrays and call to _intel_fast_memcpy on array ORDER. I need that assignement for illustration only or otherwise compiler optimize N out of the assembly.

 

Here is the same example but without undefined behavior.  https://godbolt.org/z/l0Rl8R

_intel_fast_memcpy is on line 217 and R15 register points to array ORDER.

0 Kudos
jimdempseyatthecove
Honored Contributor III
536 Views

Change the implied loop to explicit loop:

    do i=1,N
        arr_to(order(i)) = arr_from(i)
    end do
; loop to move two real(8)'s
..B2.13:                        # Preds ..B2.11 ..B2.13
; obtain next two indexes into r8 and r10
        movsxd    r8, DWORD PTR [rbp+r15*8]                     #45.28
        movsxd    r10, DWORD PTR [4+rbp+r15*8]                  #45.28
; increment the # pairs counter
        inc       r15                                           #44.5
; obtain the two arr_from values
        mov       rsi, QWORD PTR [rcx+r14]                      #45.9
        mov       r9, QWORD PTR [8+rcx+r14]                     #45.9
; advance the next arr_from index
        add       rcx, 16                                       #44.5
; insert the two arr_to values (one using r8 and index, the other r10)
        mov       QWORD PTR [-8+r13+r8*8], rsi                  #45.9
        mov       QWORD PTR [-8+r13+r10*8], r9                  #45.9
; check for move by two completion
        cmp       r15, rdx                                      #44.5
        jb        ..B2.13       # Prob 63%                      #44.5 ;; not done moving
; move by two is done
; check for move odd number (iow remainder)
        lea       esi, DWORD PTR [1+r15+r15]                    #45.9
..B2.15:                        # Preds ..B2.14 ..B2.11
        lea       edx, DWORD PTR [-1+rsi]                       #44.5
        cmp       edx, eax                                      #44.5
        jae       ..B2.17       # Prob 2%                       #44.5 ;; no remainder
; have remainder
; produce index into order
        movsxd    rsi, esi                                      #45.28
; fetch order(i)
        movsxd    rdx, DWORD PTR [-4+rbp+rsi*4]                 #45.28
; fetch arr_from(i)
        mov       rax, QWORD PTR [-8+r14+rsi*8]                 #45.9
; store last arr_to
        mov       QWORD PTR [-8+r13+rdx*8], rax                 #45.9
..B2.17:                        # Preds ..B2.15 ..B2.10 ..B2.16

No stack temporary

Jim Dempsey

0 Kudos
Andriy
Beginner
536 Views

jimdempseyatthecove (Blackbelt) wrote:

Change the implied loop to explicit loop:

    do i=1,N
        arr_to(order(i)) = arr_from(i)
    end do

 

No stack temporary

 

Jim Dempsey

I know how to fix the crash. My intention is to help fellow developers who may stumble upon such behavior wrongfully assuming that compiler will generate equivalent code for implied and explicit loops.

0 Kudos
Reply