How to decrease very poor CPI (above 7)?

vasko_anton · ‎01-09-2006

I am doing 3D gauss filtering (filter size 5) using SSE. Filtering in X or Y direction gives CPI about 1.2, but filtering in the Z direction gives CPI above 7. I supposed that the reason was due to cache misses and bigger addressing stride between filtered values when filtering in Z direction. So I interleaved the calculation with prefetching but it did not help. Can prefetching really help in this case? If yes, how far should be prefetched (is it suffiecient to prefetch one or two loop iterations ahead)?
Please, tell me how can I decrease this terrible result.

TimP · ‎01-09-2006

I will guess that the reason for bad cache behavior in Z direction is a stride too large for hardware prefetch to work (assuming you use a platform with hardware prefetch, such as P4).
If software prefetching is to be useful, it may have to be much further ahead, using non-temporal hints, if you have a CPU where that makes a difference. You have to consider how many CPU cycles are required to resolve a miss, and how many loop iterations correspond to that.
Since you are posting in the VTune forum, we might ask for more detail on what VTune says about the influence of L1, L2, and DTLB misses.
If you can organize your Z filtering so that several X values are filtered in the same inner loop, you may be able to cut down the number of misses significantly. This could be a useful form of cache blocking.
In data base applications on HyperThreaded CPUs, the standard technique for mitigating TLB misses is to thread the application, so that one thread can progress while the other is stalled on TLB miss. If you think this is not a clean way to operate, I will not argue against you.

vasko_anton · ‎01-09-2006

The code (snipet) responsible for poor CPI and cache misses:

; edx=stride
; eax=3*edx
; xmm0-xmm2 are coefficients
align 16
.Label
movaps xmm3,[esi]
movaps xmm4,[esi + edx]
movaps xmm5,[esi +2* edx]
movaps xmm6,[esi + eax]
movaps xmm7,[esi +4* edx]

addps xmm4,xmm6 ; gaussian is symmetric
addps xmm3,xmm6 ;

mulps xmm5,xmm2
mulps xmm4,xmm1
mulps xmm3,xmm0

prefetchnta [esi]
prefetchnta [esi + edx+16]
prefetchnta [esi +2* edx+16]
prefetchnta [esi + eax+16]
prefetchnta [esi +4* edx+16]

add edi,16
add esi,16

addps xmm5,xmm4
addps xmm5,xmm3

sub ecx,16
movaps [edi-16],xmm5
jnz .Label

VTune reports about 57% L2 cache read misses and 32% DTLB Walks(TI). In comparison to the filtering in X or Y direction is it very large. I tried to change the constant 16 in prefetching but it did not help.