Solved: Re: Array temporary generated for contiguous slice

hakostra1 · ‎12-07-2022

I discovered the following behavior in ifort and ifx which I found strange. To me it seems like a missed optimization opportunity. The behavior is identical in ifort and ifx.

Given the following example:

SUBROUTINE stupid(kk, jj, ii, arr)
    IMPLICIT NONE

    INTEGER :: kk, jj, ii
    REAL :: arr(kk, jj, ii)

    WRITE(*, *) arr(1, 1, 1)
END SUBROUTINE stupid


PROGRAM main
    IMPLICIT NONE

    REAL, TARGET :: arr(32*32*32)
    REAL, POINTER, CONTIGUOUS :: arr_p(:)
    INTEGER :: l, kk, jj, ii

    arr = 0

    arr_p => arr


    kk = 16
    jj = 16
    ii = 16
    l = 1
    arr_p => arr

    WRITE(*,*) SIZE(arr_p((l-1)*kk*jj*ii+1:l*kk*jj*ii))
    WRITE(*,*) kk*jj*ii

    CALL stupid(kk, jj, ii, arr_p((l-1)*kk*jj*ii+1:l*kk*jj*ii))
END PROGRAM main

With both recent 'ifx' and 'ifort' (on Linux), I get the following when I compile with '-check all':

forrtl: warning (406): fort: (1): In call to STUPID, an array temporary was created for argument #4

Image              PC                Routine            Line        Source             
output.s           000000000040591C  main                       32  example.f90
output.s           000000000040515D  Unknown               Unknown  Unknown
libc-2.31.so       00007FF451E6F083  __libc_start_main     Unknown  Unknown
output.s           000000000040507E  Unknown               Unknown  Unknown

So it seems like an array temporary is generated. This is strictly speaking not necessary, since the slice is perfectly contiguous, it is unit stride. GFortran manage to compile this without generating any temporary array.

If the calling of the 'stupid' routine is just slightly simplified it seems to be fine without any temporaries:

ip = (l-1)*kk*jj*ii+1
CALL stupid(kk, jj, ii, arr_p(ip:l*kk*jj*ii))

I have experimented a bit with variations, and to me it seems to depend on what is in the start of the slice (i.e. the part before the colon : ). If the start of the slice selection is just a plain variable (like 'ip') no temporary is generated. If there is just a simple addition, like 'ip + 1' before : it also works. Thirdly, multiplication works... However, if there is a 'complicated' expression with parenthesizes like (l-1) in there, the temporary is generated. However, that logic is not the same on the part after the :, there it seems you can have complicated expressions with () without that influencing if the compiler generate a temporary or not.

Foe example, the following does not seem to generate a temporary:

l = 0
CALL stupid(kk, jj, ii, arr_p(l*kk*jj*ii+1:(l+1)*kk*jj*ii))

Where the only difference from the original example is that there are no () before :, but this time this is after the :, but that works fine.

See the compiler explorer with a side-by-side GFortran and Intel comparison: https://godbolt.org/z/rv5c7WKo5

Ron_Green · ‎12-08-2022

bug ID CMPLRLLVM-42546

View solution in original post

andrew_4619 · ‎12-07-2022

I compiled that in windows and didn't get that warning for a temp (Intel® Fortran Compiler Classic 2021.7.0 [Intel(R) 64]) but I did get:

warning #8889: Explicit interface or EXTERNAL declaration is required. [STUPID]

If there is an explicit interface (ie the compiler knows more about STUPID) does the temp issue go away?

hakostra1 · ‎12-07-2022

No, putting it in a module does not help:

MODULE stupid_mod
IMPLICIT NONE
CONTAINS
SUBROUTINE stupid(kk, jj, ii, arr)
    IMPLICIT NONE

    INTEGER :: kk, jj, ii
    REAL :: arr(kk, jj, ii)

    WRITE(*, *) arr(1, 1, 1)
END SUBROUTINE stupid
END MODULE stupid_mod

PROGRAM main
    USE stupid_mod
    IMPLICIT NONE

    REAL, TARGET :: arr(32*32*32)
    REAL, POINTER, CONTIGUOUS :: arr_p(:)
    INTEGER :: l, kk, jj, ii

    arr = 0

    arr_p => arr


    kk = 16
    jj = 16
    ii = 16
    l = 1
    arr_p => arr

    WRITE(*,*) SIZE(arr_p((l-1)*kk*jj*ii+1:l*kk*jj*ii))
    WRITE(*,*) kk*jj*ii

    CALL stupid(kk, jj, ii, arr_p((l-1)*kk*jj*ii+1:l*kk*jj*ii))
END PROGRAM main

gives same message/warning.

andrew_4619 · ‎12-07-2022

I have seen many cases where slices cause creation of temps in the past the compiler does keep improving in this respect.

I think your usage case is not so clear, your dependencies on l, ii,jj,kk might be the thing that defeats the general case rules the compiler is applying as it needs to unpick those, maybe that is deferred to run-time.

I realise this is a demo/test case program but why us the arr_p pointer at all? And why specify the upper bound of the slice with this design you need some in-code bound checking anyway.

hakostra1 · ‎12-07-2022

Well, consider the following, further simplified example:

MODULE stupid_mod
IMPLICIT NONE
CONTAINS
SUBROUTINE stupid(n, arr)

    INTEGER :: n
    REAL :: arr(n)

    WRITE(*, *) arr
END SUBROUTINE stupid
END MODULE stupid_mod


PROGRAM main
    USE stupid_mod
    IMPLICIT NONE

    REAL, TARGET :: arr(1000)
    REAL, POINTER, CONTIGUOUS :: arr_p(:)
    INTEGER :: ip, n

    arr = 0

    arr_p => arr

    ip = 1
    n = 10
    CALL stupid(n, arr_p((ip):ip+n-1))
END PROGRAM main

Compiler explorer link: https://godbolt.org/z/dMGYa4vxT

This trigger the generation of a temporary.

The funny thing is that it is the parenthesizes of the left hand side of the : that trigger this, the following does not generate a temporary:

CALL stupid(n, arr_p(ip:ip+n-1))

So putting the variable "ip" in a parenthesis like "(ip)" generate a temporary, while "ip" does not. On the right hand side this has no effect on the behavior, i.e. I can add as many patentheises as I wish...

Ron_Green · ‎12-07-2022

This is related but a little off topic. Some years back we collected and documented how array passing methods affects optimization and vectorization. It includes discussions of when temps are created. It is a bit tangental to this thread but it allows insight into the compiler.

https://www.intel.com/content/www/us/en/developer/articles/technical/fortran-array-data-and-arguments-and-vectorization.html

andrew_4619 · ‎12-07-2022

I think the compiler is not clever enough, it looks at the slice defined by variable expressions and just gives up and assumes it is indirect and makes the temp. I guess working sub-optimally is better than risking not working at all. Maybe a better design would be to simplify, to maximise the possibility of avoiding a temp

MODULE stupid_mod
IMPLICIT NONE
CONTAINS
SUBROUTINE stupid(n, arr)
    INTEGER :: n
    REAL :: arr(:)
    WRITE(*, *) arr(1:n)
END SUBROUTINE stupid
END MODULE stupid_mod

PROGRAM main
    USE stupid_mod
    IMPLICIT NONE
    REAL, TARGET :: arr(1000)
    REAL, POINTER, CONTIGUOUS :: arr_p(:)
    INTEGER :: ip, n
    arr = 0
    arr_p => arr
    ip = 1
    n = 10
    CALL stupid( n, arr_p(ip:) )
END PROGRAM main

FortranFan · ‎12-07-2022

@hakostra1 ,

Until you somehow manage to convince Intel team with your gfortran example to not invoke array temporaries here, you may consider options you can bring to bear to make it easier on the compiler and the users of your code which may primarily be you yourself even? Among others, the following is also one you can think about ..

PROGRAM main
    IMPLICIT NONE

    REAL, TARGET :: arr(32*32*32)
    REAL, POINTER, CONTIGUOUS :: arr_p(:)
    REAL, POINTER, CONTIGUOUS :: arr_p_slice(:) !<-- use this for your slice?
    INTEGER :: l, kk, jj, ii

    arr = 0

    arr_p => arr !<-- perhaps use this object for the whole object reference?

    kk = 16
    jj = 16
    ii = 16
    l = 1

    arr_p_slice => arr((l-1)*kk*jj*ii+1:l*kk*jj*ii)
    WRITE(*,*) SIZE(arr((l-1)*kk*jj*ii+1:l*kk*jj*ii))
    WRITE(*,*) kk*jj*ii

    CALL stupid(kk, jj, ii, arr_p_slice)
END PROGRAM main

hakostra1 · ‎12-07-2022

Thanks for the comments, everyone. It's not about making it work, because I already found several ways to trick the compiler into generating code that does what I want without making a temporary (i.e. avoid parenthesizes).

I wrote the post here, because I was puzzled by the fact that a 1-D unit-stride slice of a contiguous rank-1 array, is always guaranteed to be contiguous, and no temporary should never be needed. Please correct me if I'm wrong here...

GFortran seems to get this right, I have not found any situations when it generate a temporary in this case. However, as soon as you make non-unit-strides or slice n-D arrays of higher ranks than 1, then temporaries are generated as required. For the Intel compiler(s), this just seems like a missed opportunity...

jimdempseyatthecove · ‎12-08-2022

hakostra1, I think you did an excellent job of identifying an optimization issue.

Side comment:

It appears that the linear (1D) array arr is being partitioned into 3D tiles. This being the case, as long as all code uses the same values for ii, jj, kk at all times during run, then the slicing will be correct. Any change to any of the size values will either require non-unit stride .OR. cannot be described using a non-unit stride (and thus require a temporary).

Jim Dempsey

Ron_Green · ‎12-08-2022

I agree that there is a chance to improve the compiler for this unneeded arg temp creation. I'll open a bug report on this.

I simply combined the 2 call types, shown below. Just to prove to the devs that the temp is only on the 2nd call

MODULE stupid_mod
IMPLICIT NONE
CONTAINS
SUBROUTINE stupid(n, arr)

    INTEGER :: n
    REAL :: arr(n)

    WRITE(*, *) arr
END SUBROUTINE stupid
END MODULE stupid_mod


PROGRAM main
    USE stupid_mod
    IMPLICIT NONE

    REAL, TARGET :: arr(1000)
    REAL, POINTER, CONTIGUOUS :: arr_p(:)
    INTEGER :: ip, n

    arr = 0

    arr_p => arr

    ip = 1
    n = 10
    CALL stupid(n, arr_p(ip:ip+n-1))
    
    CALL stupid(n, arr_p((ip):ip+n-1))
END PROGRAM main

Ron_Green · ‎12-08-2022

bug ID CMPLRLLVM-42546

Ron_Green · ‎11-21-2023

this bug is fixed in the 2024.0 release.