temporary array creation, optimisation and parameter keyword

Martin_Stein · ‎11-05-2010

I'm working with twodimensional real arrays of the form A(index_func,index_point) storing function values for a set of points. A usual expression in the code would be A(index_velocity(1:3),index_point) which retrieves the velocity vector for index_point. It is clear that without any information regarding index_velocity(1:3) no specific optimisation and avoiding of temporary array creation in subroutine calls is possible. However if for example I declare

integer, parameter :: index_velocity(1:3) = (/6,7,8/)

I would expect that the compiler actually replaces A(index_velocity(1:3),index_point) with A(6:8,index_point) to apply further optimisations.
However, I ifort (version 10) does not seems to apply any optimisations, at least I can see that it still creates temporary arrays. Is there a reason why the compiler does not apply such an obvious optimisation step (such as a misinterpretation of the parameter keyword by me)?
Any help is appreciated!

Steven_L_Intel1 · ‎11-05-2010

The compiler can do value propagation from array constants, but you've made it a bit harder by using subscripts in the reference. Also, what you have is an array with vector subscripts which is treated as an expression by the language and the compiler generally creates temps for those, depending on the context.

Can you show a small but complete program that demonstrates the issue? Also, you might try a newer compiler - version 10 is rather old at this point.

Martin_Stein · ‎11-11-2010

Below is a simple so somwhat dumb example. I would expect that the compiler handles such cases by replacing the ind_velocity array subscript with the defined constant values (optimisation is -xO -O3). Instead it refuses to compile due to the array subscript in the subroutine call in conjunction with the intent(inout) flag for the array parameter v of the subroutine add_vector. Replacing the intent(inout) with intent(in) and the addition line with a write command the module compiles. However, with the 'check all' compiler options I confirmed that a temporary array is created.

[bash]module test

implicit none
private

public increase_velocity, ind_velocity

integer(4), dimension(1:3), parameter :: ind_velocity = (/4,5,6/)


contains


   subroutine add_vector(v, w)
      real(4), dimension(1:3), intent(inout)&
            :: v
      real(4), dimension(1:3), intent(in)&
            :: w
      v(1:3) = v(1:3) + w(1:3)
   end subroutine add_vector


   subroutine increase_velocity(A, v_inc)
      real(4), dimension(1:10,1:10), intent(inout)&
            :: A
      real(4), dimension(1:3), intent(in)&
            :: v_inc
      integer(4)&
            :: i
      do i = 1,10
         call add_vector(A(ind_velocity(1:3),i), v_inc)
      end do

   end subroutine increase_velocity

end module test
[/bash]

Using the line

[bash]w(1:3) = v(1:3) + w(1:3)
[/bash]

and swapping the intent(inout) and intent(in) in the two arguments of increase_velocity I get the following machine code, which reveals that the compiler obviously did inlining as well as loop-unrolling. It also clearly shows that no advantage of the predefined ind_velocity values is taken, as these values are loaded from memory into registers and used in adress calculations. In fact replacing the predefined values by hand reduces the code in almost the way I would have expected it. Address offsets are computed at compile time and no rax rdx or rcx registers are used. (On a side note, what I did not see although strongly expected was that movss and addss commands would be replaced by packed versions of these commands. Are those commands not available in SSE (-xO option) or what might be the reason?)

[bash]# parameter 1(a): %rdi
# parameter 2(v_inc): %rsi
        movq      %rsi, %rbx                                    #24.15
        movslq    triangle_mp_ind_velocity_(%rip), %rax         #32.28
        movslq    4+triangle_mp_ind_velocity_(%rip), %rdx       #32.28
        movslq    8+triangle_mp_ind_velocity_(%rip), %rcx       #32.28
        movss     -4(%rdi,%rax,4), %xmm0                        #32.26
        movss     -4(%rdi,%rdx,4), %xmm1                        #32.26
        movss     -4(%rdi,%rcx,4), %xmm2                        #32.26
        addss     (%rbx), %xmm0                                 #32.15
        addss     4(%rbx), %xmm1                                #32.15
        addss     8(%rbx), %xmm2                                #32.15
        addss     36(%rdi,%rax,4), %xmm0                        #32.15
        addss     36(%rdi,%rdx,4), %xmm1                        #32.15
        addss     36(%rdi,%rcx,4), %xmm2                        #32.15
        addss     76(%rdi,%rax,4), %xmm0                        #32.15
        addss     76(%rdi,%rdx,4), %xmm1                        #32.15
        addss     76(%rdi,%rcx,4), %xmm2                        #32.15
 ...

[/bash]

Regarding the use of the old ifort-10 compiler: I plan to evaluate some compilers for an compiler update, so that we might be able to use ipo (which is does not work for our code due to some bug) and the upcoming ISA-extensions.

mecej4 · ‎11-11-2010

Did you try compiling your example with a recent compiler? With IFort 11.1 or 12, and with GFortran 4.5 I see a syntax error for line-31, objecting to passing an array section using a vector subscript when the formal argument has intent other than OUT. In other words, an actual argument must be "definable" if the formal argument has intent INOUT or OUT.

Martin_Stein · ‎11-11-2010

As I wrote, it does not compile, as it does not replace the ind_velocity(1:3) with the assigned values. My main question is just that: Why is that not done, even so it provides a very simple and obvious optimisation? (I understand the reason that in general (without a parameter declaration) a fortran compiler should refuse to compile.)

mecej4 · ‎11-11-2010

Sorry, I overlooked the lines where you said that the machine code listing was obtained with a modified version of the Fortran code.

Let's see. What would you like the compiler to do (assuming that the Fortran standard allows that action, whatever it might be), if the vector subscript was set to, say, (/ 3, 5, 7 /) instead of being given consecutive values, as you did in your example? Or do you expect a vector subscript with consecutive values to be treated as a special case, ripe for optimization?

Martin_Stein · ‎11-12-2010

Well, I did expect just that, but maybe thats just too special. With ifort-10 it seems that the parameter keyword for arrays just protects the array element from write over. For example, the following code

[bash]      A(1,1) = sum(A(ind_velocity(1:3),1)**2)
[/bash]

inserted in the subroutine increase_velocity produces machine code, which computes the adress locations at run-time (via three additional movs for reading ind_velocity(1:3) and three address calculation within the mov for loading the xmm registers). At least in such a case I would expect that the compiler replaces the ind_velocity(1:3) expression and then runs further optimisations (such as packing). In the case mentioned my second post I might understand that Fortran standard might even forbid to allow such calls due to the intent(inout) attribute.

I also like to ask again, why the compiler does not make any use of packed instructions to move or multiply the values in the above or similar instructions? I'm very well aware that having three values is not helping much, but even if I look at the machine code for

[bash]   A(1,1) = sum(A(1:4,1)**2)
[/bash]

only the scalar type move and mul commands are used? Has that been improved in the newer compiler versions. I have tons of such code lines and it feels like a waste of silicon to have all those fancy SSEx commands but do not see them used.

jimdempseyatthecove · ‎11-12-2010

Martin,

Will index_velocity always return /n,n+1,n+2/?

If so then consider

type Vec3
real :: v(3)
end type Vec3
...
type (Vec3) :: A(nVecs, nOther)

Note, for REAL(4) you might consider making Vec3 take 4 REALs (SSE vector size)

Then your index_velocity becomes a function returning a single value fromthe sequence 1, 2, 3, ... nVecs

Also, if Vec3 is used inside a user defined type containing SEQUENCE then you will have to add SEQUENCE to Vec3

type Vec3
SEQUENCE
real :: v(3)
end type Vec3

Jim Dempsey

Martin_Stein · ‎11-12-2010

> Will index_velocity always return /n,n+1,n+2/?
Well, it will always be something like (/4,5,6/), it is defined as a constant values! There are several reasons for using index_velocity instead of (/4,5,6/) or even preprocessor macros: 1. Readabilty 2. Flexibility (in case I want to change order...) 3. Lots of existing Code.
Originally index_velocity was not known at compile time. My expectation was that fixing the values (at least the most common ones, as some index_functions can only be assigned at run-time are still required) at compile time gives the compiler the opportunity to optimise quite a bit. However I do not see any optimisations done making use of the parameter keyword and that suprises me, even if I use something like A(1:4,i) as mentioned in my last post.

(BTW, I already thought about extending three-dimensional vectors to four dimensions but as long as I do not see packed SSE instructions in my code, there is simply no reason to consider such a step. Once on this track of thinking further questions come to mind: How about the choice of the dimension N in A(1:N,1:M) to optimise for aligned memory access in packed SSE commands and how to choose the index values of often used index_vec vector funcitons such as index_velocity. If I choose N=4k or N=8k or even N=16k (with the upcoming ymm registers) for some integer k, is the compiler/compiled code able to efficiently recognise alignment?)