a confusion about the assembly code after vectorization

Deleted_U_Intel · ‎10-13-2004

I vectorize the program below and getthe assembly code.However, Ifind the asm code spend a large piecetoadd 2000 to the arrayvar addresswhen increasing j

program test2
real frct(5,100,100)
integer m,i,j,k

do j=1, 44
do i=1, 47
do m=1, 5
frct(m,i,j)=0.0d+0
enddo
enddo
enddo
end

here only to make the %ecx=2000 and %edi=2000+928,

why not compute then directly???

lea (%eax,%eax), %ecx #8.20
addl %ecx, %ecx #8.20
subl %eax, %ecx #8.20
lea (%eax,%ecx,8), %ebp #8.20
lea (%ebp,%ebp,4), %ebp #8.20
addl %ebp, %ebp #8.20
lea (%ebp,%ebp), %edi #8.20
addl %edi, %edi #8.20
lea (%edi,%edi), %ecx #8.20
lea 928(%edi,%edi), %edi #8.20
.align 4,0x90

Intel_C_Intel · ‎10-13-2004

Dear Customer,

If the subroutine would initialize the full array (with trip-counts 100, 100, and 5) then the triple nest wouldfully collaps into a loop that iterates 100x100x5 times, as illustrated below, where frct_coll1d forms a 1-dim overlay of the original array:

do jim=1, 50000
frct_coll1d(jim) = 0
enddo

This would vectorize into the following efficient SIMD code:

xorl %eax, %eax
pxor %xmm0, %xmm0
L:
movaps %xmm0, test2_$FRCT(%eax)
movaps %xmm0, 16+test2_$FRCT(%eax)
movaps %xmm0, 32+test2_$FRCT(%eax)
movaps %xmm0, 48+test2_$FRCT(%eax)
addl $64, %eax
cmpl $200000, %eax
jb L

The partial initialization done in your code only allows collapsing the m- and i-loop, yielding an innermost loop with 5x47 iterations. For a vector length of 4 (packed floats), some iterations of this collapsed loop must be done sequentially, as illustrated below, where frct_coll2d forms a 2-dim overlay:

do j=1, 44
do im=1, 232, 4
frct_coll(im:im+3:1, j) = 0
enddo
frct_coll(233, j) = 0
frct_coll(234, j) = 0
frct_coll(235, j) = 0
enddo

The intermediate representation of this overlay seems to disable some back-end optimizations, which eventually yields suboptimal code for the address setup.

Please allow us to further investigate this issue. In the meanwhile, many thanks for bringing this to our attention!

Aart Bik
http://www.aartbik.com

Message Edited by abik on 10-14-2004 01:59 PM

Intel_C_Intel · ‎10-15-2004

Dear Customer,

Please call me Aart.

Your second example illustrates the same problem, namely the lack of strength reduction on parts of address computations (luckily only at higher nesting levels). I spoke with a code-generation expert and we were able to improve the interaction between the vectorizer and subsequent optimization phases. For your initial example, full strength reduction now occurs onaddress computations, as can be seen below.

..B1.3:
movl %eax, %esi
lea 928(%eax), %ebx
..B1.4:
movaps %xmm0, test2_$FRCT(%esi)
movaps %xmm0, 16+test2_$FRCT(%esi)
addl $32, %esi
cmpl %esi, %ebx
ja ..B1.4
..B1.5:
movl %edi, 928+test2_$FRCT(%eax)
movl %edx, 932+test2_$FRCT(%eax)
movl %ecx, 936+test2_$FRCT(%eax)
addl $2000, %eax
cmpl $88000, %eax
jb ..B1.3

Likewise, the multiplication in your second example is now replaced by an induction sequence. For these particular loops, I did not observe much performance difference, but I believe the improvement may be profitable in many other instances, and I thank you for reporting the problem!

Aart Bik
http://www.aartbik.com