Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.
29234 讨论

Array temporary generated for contiguous slice

hakostra1
新分销商 II
2,593 次查看

I discovered the following behavior in ifort and ifx which I found strange. To me it seems like a missed optimization opportunity. The behavior is identical in ifort and ifx.

Given the following example:

SUBROUTINE stupid(kk, jj, ii, arr)
    IMPLICIT NONE

    INTEGER :: kk, jj, ii
    REAL :: arr(kk, jj, ii)

    WRITE(*, *) arr(1, 1, 1)
END SUBROUTINE stupid


PROGRAM main
    IMPLICIT NONE

    REAL, TARGET :: arr(32*32*32)
    REAL, POINTER, CONTIGUOUS :: arr_p(:)
    INTEGER :: l, kk, jj, ii

    arr = 0

    arr_p => arr


    kk = 16
    jj = 16
    ii = 16
    l = 1
    arr_p => arr

    WRITE(*,*) SIZE(arr_p((l-1)*kk*jj*ii+1:l*kk*jj*ii))
    WRITE(*,*) kk*jj*ii

    CALL stupid(kk, jj, ii, arr_p((l-1)*kk*jj*ii+1:l*kk*jj*ii))
END PROGRAM main

With both recent 'ifx' and 'ifort' (on Linux), I get the following when I compile with '-check all':

forrtl: warning (406): fort: (1): In call to STUPID, an array temporary was created for argument #4

Image              PC                Routine            Line        Source             
output.s           000000000040591C  main                       32  example.f90
output.s           000000000040515D  Unknown               Unknown  Unknown
libc-2.31.so       00007FF451E6F083  __libc_start_main     Unknown  Unknown
output.s           000000000040507E  Unknown               Unknown  Unknown

So it seems like an array temporary is generated. This is strictly speaking not necessary, since the slice is perfectly contiguous, it is unit stride. GFortran manage to compile this without generating any temporary array.

If the calling of the 'stupid' routine is just slightly simplified it seems to be fine without any temporaries:

ip = (l-1)*kk*jj*ii+1
CALL stupid(kk, jj, ii, arr_p(ip:l*kk*jj*ii))

I have experimented a bit with variations, and to me it seems to depend on what is in the start of the slice (i.e. the part before the colon : ). If the start of the slice selection is just a plain variable (like 'ip') no temporary is generated. If there is just a simple addition, like 'ip + 1' before : it also works. Thirdly, multiplication works... However, if there is a 'complicated' expression with parenthesizes like (l-1) in there, the temporary is generated. However, that logic is not the same on the part after the :, there it seems you can have complicated expressions with () without that influencing if the compiler generate a temporary or not.

Foe example, the following does not seem to generate a temporary:

l = 0
CALL stupid(kk, jj, ii, arr_p(l*kk*jj*ii+1:(l+1)*kk*jj*ii))

Where the only difference from the original example is that there are no () before :, but this time this is after the :, but that works fine.

See the compiler explorer with a side-by-side GFortran and Intel comparison: https://godbolt.org/z/rv5c7WKo5

0 项奖励
1 解答
Ron_Green
主持人
2,482 次查看

bug ID CMPLRLLVM-42546


在原帖中查看解决方案

0 项奖励
12 回复数
andrew_4619
名誉分销商 III
2,588 次查看

I compiled that in windows and didn't get that warning for a temp (Intel® Fortran Compiler Classic 2021.7.0 [Intel(R) 64]) but I did get:

warning #8889: Explicit interface or EXTERNAL declaration is required. [STUPID]

If there is an explicit interface (ie the compiler knows more about STUPID) does the temp issue go away?

 

0 项奖励
hakostra1
新分销商 II
2,584 次查看

No, putting it in a module does not help:

MODULE stupid_mod
IMPLICIT NONE
CONTAINS
SUBROUTINE stupid(kk, jj, ii, arr)
    IMPLICIT NONE

    INTEGER :: kk, jj, ii
    REAL :: arr(kk, jj, ii)

    WRITE(*, *) arr(1, 1, 1)
END SUBROUTINE stupid
END MODULE stupid_mod

PROGRAM main
    USE stupid_mod
    IMPLICIT NONE

    REAL, TARGET :: arr(32*32*32)
    REAL, POINTER, CONTIGUOUS :: arr_p(:)
    INTEGER :: l, kk, jj, ii

    arr = 0

    arr_p => arr


    kk = 16
    jj = 16
    ii = 16
    l = 1
    arr_p => arr

    WRITE(*,*) SIZE(arr_p((l-1)*kk*jj*ii+1:l*kk*jj*ii))
    WRITE(*,*) kk*jj*ii

    CALL stupid(kk, jj, ii, arr_p((l-1)*kk*jj*ii+1:l*kk*jj*ii))
END PROGRAM main

gives same message/warning.

0 项奖励
andrew_4619
名誉分销商 III
2,569 次查看

I have seen many cases where slices cause creation of temps in the past the compiler does keep improving in this respect.

I think your usage case is not so clear, your dependencies on l, ii,jj,kk  might be the thing that defeats  the general case rules the compiler is applying as it needs to unpick those, maybe that is deferred to run-time. 

I realise this is a demo/test case program but why us the arr_p pointer at all? And why specify the upper bound of the slice with this design you need some in-code bound checking anyway. 

 

 

0 项奖励
hakostra1
新分销商 II
2,556 次查看

Well, consider the following, further simplified example:

MODULE stupid_mod
IMPLICIT NONE
CONTAINS
SUBROUTINE stupid(n, arr)

    INTEGER :: n
    REAL :: arr(n)

    WRITE(*, *) arr
END SUBROUTINE stupid
END MODULE stupid_mod


PROGRAM main
    USE stupid_mod
    IMPLICIT NONE

    REAL, TARGET :: arr(1000)
    REAL, POINTER, CONTIGUOUS :: arr_p(:)
    INTEGER :: ip, n

    arr = 0

    arr_p => arr

    ip = 1
    n = 10
    CALL stupid(n, arr_p((ip):ip+n-1))
END PROGRAM main

Compiler explorer link: https://godbolt.org/z/dMGYa4vxT

This trigger the generation of a temporary.

The funny thing is that it is the parenthesizes of the left hand side of the : that trigger this, the following does not generate a temporary:

CALL stupid(n, arr_p(ip:ip+n-1))

So putting the variable "ip" in a parenthesis like "(ip)" generate a temporary, while "ip" does not. On the right hand side this has no effect on the behavior, i.e. I can add as many patentheises as I wish...

0 项奖励
Ron_Green
主持人
2,542 次查看

This is related but a little off topic.  Some years back we collected and documented how array passing methods affects optimization and vectorization.  It includes discussions of when temps are created.  It is a bit tangental to this thread but it allows insight into the compiler.

 

https://www.intel.com/content/www/us/en/developer/articles/technical/fortran-array-data-and-arguments-and-vectorization.html

 

 

0 项奖励
andrew_4619
名誉分销商 III
2,535 次查看

I think the compiler is not clever enough, it looks at the slice defined by variable expressions and just gives up and assumes it is indirect and makes the temp. I guess working sub-optimally is better than risking not working at all. Maybe a better design would be to simplify, to maximise the possibility of avoiding a temp

 

MODULE stupid_mod
IMPLICIT NONE
CONTAINS
SUBROUTINE stupid(n, arr)
    INTEGER :: n
    REAL :: arr(:)
    WRITE(*, *) arr(1:n)
END SUBROUTINE stupid
END MODULE stupid_mod

PROGRAM main
    USE stupid_mod
    IMPLICIT NONE
    REAL, TARGET :: arr(1000)
    REAL, POINTER, CONTIGUOUS :: arr_p(:)
    INTEGER :: ip, n
    arr = 0
    arr_p => arr
    ip = 1
    n = 10
    CALL stupid( n, arr_p(ip:) )
END PROGRAM main

 

 

0 项奖励
FortranFan
名誉分销商 III
2,526 次查看

@hakostra1 ,

Until you somehow manage to convince Intel team with your gfortran example to not invoke array temporaries here, you may consider options you can bring to bear to make it easier on the compiler and the users of your code which may primarily be you yourself even?  Among others, the following is also one you can think about ..

PROGRAM main
    IMPLICIT NONE

    REAL, TARGET :: arr(32*32*32)
    REAL, POINTER, CONTIGUOUS :: arr_p(:)
    REAL, POINTER, CONTIGUOUS :: arr_p_slice(:) !<-- use this for your slice?
    INTEGER :: l, kk, jj, ii

    arr = 0

    arr_p => arr !<-- perhaps use this object for the whole object reference?

    kk = 16
    jj = 16
    ii = 16
    l = 1

    arr_p_slice => arr((l-1)*kk*jj*ii+1:l*kk*jj*ii)
    WRITE(*,*) SIZE(arr((l-1)*kk*jj*ii+1:l*kk*jj*ii))
    WRITE(*,*) kk*jj*ii

    CALL stupid(kk, jj, ii, arr_p_slice)
END PROGRAM main
0 项奖励
hakostra1
新分销商 II
2,516 次查看

Thanks for the comments, everyone. It's not about making it work, because I already found several ways to trick the compiler into generating code that does what I want without making a temporary (i.e. avoid parenthesizes).

I wrote the post here, because I was puzzled by the fact that a 1-D unit-stride slice of a contiguous rank-1 array, is always guaranteed to be contiguous, and no temporary should never be needed. Please correct me if I'm wrong here...

GFortran seems to get this right, I have not found any situations when it generate a temporary in this case. However, as soon as you make non-unit-strides or slice n-D arrays of higher ranks than 1, then temporaries are generated as required. For the Intel compiler(s), this just seems like a missed opportunity...

0 项奖励
jimdempseyatthecove
名誉分销商 III
2,504 次查看

hakostra1, I think you did an excellent job of identifying an optimization issue.

 

Side comment:

It appears that the linear (1D) array arr is being partitioned into 3D tiles. This being the case, as long as all code uses the same values for ii, jj, kk at all times during run, then the slicing will be correct. Any change to any of the size values will either require non-unit stride .OR. cannot be described using a non-unit stride (and thus require a temporary).

 

Jim Dempsey

0 项奖励
Ron_Green
主持人
2,492 次查看

I agree that there is a chance to improve the compiler for this unneeded arg temp creation.  I'll open a bug report on this.

I simply combined the 2 call types, shown below.  Just to prove to the devs that the temp is only on the 2nd call

 

MODULE stupid_mod
IMPLICIT NONE
CONTAINS
SUBROUTINE stupid(n, arr)

    INTEGER :: n
    REAL :: arr(n)

    WRITE(*, *) arr
END SUBROUTINE stupid
END MODULE stupid_mod


PROGRAM main
    USE stupid_mod
    IMPLICIT NONE

    REAL, TARGET :: arr(1000)
    REAL, POINTER, CONTIGUOUS :: arr_p(:)
    INTEGER :: ip, n

    arr = 0

    arr_p => arr

    ip = 1
    n = 10
    CALL stupid(n, arr_p(ip:ip+n-1))
    
    CALL stupid(n, arr_p((ip):ip+n-1))
END PROGRAM main
Ron_Green
主持人
2,483 次查看

bug ID CMPLRLLVM-42546


0 项奖励
Ron_Green
主持人
2,038 次查看

this bug is fixed in the 2024.0 release.


0 项奖励
回复