Please, tell me how can I decrease this terrible result.

If software prefetching is to be useful, it may have to be much further ahead, using non-temporal hints, if you have a CPU where that makes a difference. You have to consider how many CPU cycles are required to resolve a miss, and how many loop iterations correspond to that.

Since you are posting in the VTune forum, we might ask for more detail on what VTune says about the influence of L1, L2, and DTLB misses.

If you can organize your Z filtering so that several X values are filtered in the same inner loop, you may be able to cut down the number of misses significantly. This could be a useful form of cache blocking.

In data base applications on HyperThreaded CPUs, the standard technique for mitigating TLB misses is to thread the application, so that one thread can progress while the other is stalled on TLB miss. If you think this is not a clean way to operate, I will not argue against you.

; edx=stride

; eax=3*edx

; xmm0-xmm2 are coefficients

align 16

.Label

movaps xmm3,[esi]

movaps xmm4,[esi + edx]

movaps xmm5,[esi +2* edx]

movaps xmm6,[esi + eax]

movaps xmm7,[esi +4* edx]

addps xmm4,xmm6 ; gaussian is symmetric

addps xmm3,xmm6 ;

mulps xmm5,xmm2

mulps xmm4,xmm1

mulps xmm3,xmm0

prefetchnta [esi]

prefetchnta [esi + edx+16]

prefetchnta [esi +2* edx+16]

prefetchnta [esi + eax+16]

prefetchnta [esi +4* edx+16]

add edi,16

add esi,16

addps xmm5,xmm4

addps xmm5,xmm3

sub ecx,16

movaps [edi-16],xmm5

jnz .Label

VTune reports about 57% L2 cache read misses and 32% DTLB Walks(TI). In comparison to the filtering in X or Y direction is it very large. I tried to change the constant 16 in prefetching but it did not help.

