Performance problem probably created by Aliasing Conflicts

johnlinz · ‎08-22-2003

I implemented an image processing algorithm with the help of the new sse2 instructions (on an IA32/P4). For calculation I?m using mainly XMM register. For one thing I have to use the general purpose register. I found out that the execution time will decrease form 38ms to 25ms if I add an shift instruction i.g. shl edx,o.

With the help of VTune I figured out that there are less ?64k Aliasing Conflicts?, if the
shl edx,0 instruction is added. But there are still a lot Aliasing Conflicts.

What?s the reason for this?
Does anybody have an idea?
How can I reduce the Aliasing Conflicts?

This is a section of the source code:
?
PADDW xmm5,xmm6
PADDW xmm1,xmm2
PADDW xmm4,xmm5

MOVD esi,xmm4 ;copy 32 the low 32bits of xmm4 in esi
shl esi,16 ;set bit 15-31 to zero
shr esi,16

shl edx,0 ; no modification

sub esi,edx
shr esi,5 ; esi divided by /32
MOV [edi],esi
inc edi
?

Thanks a lot!

Johannes

TimP · ‎08-22-2003

64k aliasing conflicts are cache evictions associated with mapping conflicts between multiple data regions. Vtune will show "a lot" of them, no matter what. Changes in number reported by Vtune, by less than a factor of 2, may not be meaningful. The usual prescription for alleviating them is to offset the relative placement of data regions which need to share cache. You want to avoid needing cache lines which are exactly 64k apart. You have given no information to indicate whether this may be relevant. 64k aliasing has been recognized as a basic deficiency which should be remedied in the latest round of steppings, so the need to deal with it should be limited to chips produced in the past.

johnlinz · ‎08-22-2003

I changed this code section:
"shl esi,16
shr esi,16
shl edx,0 "
to
" esi,0xffff"
And I got the same performance results.

With Vtune I got this value:
((64k Aliasing Conflicts*12) / Clockticks)*100 = 18,55

How can I offset the relative placement of data regions ?

bnshah · ‎09-05-2003

To fix 64k aliasing conflicts you need to step through your code w/ a debugger and look at the different memory addresses you are loading from and storing to. If you see addresses that are offset by a multiple of 64k (to be exact with address bits 7-15 the same since we really care about cache lines here not individual addresses) then you can have a potential aliasing problem. However from the code you provided it does not look like this is the problem. It looks like this code runs for too short of a time to really capture any useful data w/ VTune. You should make your loop run for at least 10s. Then see what the performance impact of your code changes are. The performance differences (38ms vs 25ms) may just be noise.

However if it really is a 64k aliasing problem you can fix it by padding your data structures so the access patterns arent multiples of 64k aparat