- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
real frct(5,100,100)
integer m,i,j,k
do i=1, 47
do m=1, 5
frct(m,i,j)=0.0d+0
enddo
enddo
enddo
end
addl %ecx, %ecx #8.20
subl %eax, %ecx #8.20
lea (%eax,%ecx,8), %ebp #8.20
lea (%ebp,%ebp,4), %ebp #8.20
addl %ebp, %ebp #8.20
lea (%ebp,%ebp), %edi #8.20
addl %edi, %edi #8.20
lea (%edi,%edi), %ecx #8.20
lea 928(%edi,%edi), %edi #8.20
.align 4,0x90
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dear Customer,
If the subroutine would initialize the full array (with trip-counts 100, 100, and 5) then the triple nest wouldfully collaps into a loop that iterates 100x100x5 times, as illustrated below, where frct_coll1d forms a 1-dim overlay of the original array:
do jim=1, 50000
frct_coll1d(jim) = 0
enddo
This would vectorize into the following efficient SIMD code:
xorl %eax, %eax
pxor %xmm0, %xmm0
L:
movaps %xmm0, test2_$FRCT(%eax)
movaps %xmm0, 16+test2_$FRCT(%eax)
movaps %xmm0, 32+test2_$FRCT(%eax)
movaps %xmm0, 48+test2_$FRCT(%eax)
addl $64, %eax
cmpl $200000, %eax
jb L
The partial initialization done in your code only allows collapsing the m- and i-loop, yielding an innermost loop with 5x47 iterations. For a vector length of 4 (packed floats), some iterations of this collapsed loop must be done sequentially, as illustrated below, where frct_coll2d forms a 2-dim overlay:
do j=1, 44
do im=1, 232, 4
frct_coll(im:im+3:1, j) = 0
enddo
frct_coll(233, j) = 0
frct_coll(234, j) = 0
frct_coll(235, j) = 0
enddo
The intermediate representation of this overlay seems to disable some back-end optimizations, which eventually yields suboptimal code for the address setup.
Please allow us to further investigate this issue. In the meanwhile, many thanks for bringing this to our attention!
Aart Bik
http://www.aartbik.com
Message Edited by abik on 10-14-2004 01:59 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dear Customer,
Please call me Aart.
Your second example illustrates the same problem, namely the lack of strength reduction on parts of address computations (luckily only at higher nesting levels). I spoke with a code-generation expert and we were able to improve the interaction between the vectorizer and subsequent optimization phases. For your initial example, full strength reduction now occurs onaddress computations, as can be seen below.
..B1.3:
movl %eax, %esi
lea 928(%eax), %ebx
..B1.4:
movaps %xmm0, test2_$FRCT(%esi)
movaps %xmm0, 16+test2_$FRCT(%esi)
addl $32, %esi
cmpl %esi, %ebx
ja ..B1.4
..B1.5:
movl %edi, 928+test2_$FRCT(%eax)
movl %edx, 932+test2_$FRCT(%eax)
movl %ecx, 936+test2_$FRCT(%eax)
addl $2000, %eax
cmpl $88000, %eax
jb ..B1.3
Likewise, the multiplication in your second example is now replaced by an induction sequence. For these particular loops, I did not observe much performance difference, but I believe the improvement may be profitable in many other instances, and I thank you for reporting the problem!
Aart Bik
http://www.aartbik.com

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page