Performance problem probably created by Aliasing Conflicts
I implemented an image processing algorithm with the help of the new sse2 instructions (on an IA32/P4). For calculation I?m using mainly XMM register. For one thing I have to use the general purpose register. I found out that the execution time will decrease form 38ms to 25ms if I add an shift instruction i.g. shl edx,o.
With the help of VTune I figured out that there are less ?64k Aliasing Conflicts?, if the shl edx,0 instruction is added. But there are still a lot Aliasing Conflicts.
What?s the reason for this? Does anybody have an idea? How can I reduce the Aliasing Conflicts?
This is a section of the source code: ? PADDW xmm5,xmm6 PADDW xmm1,xmm2 PADDW xmm4,xmm5
MOVD esi,xmm4 ;copy 32 the low 32bits of xmm4 in esi shl esi,16 ;set bit 15-31 to zero shr esi,16
shl edx,0 ; no modification
sub esi,edx shr esi,5 ; esi divided by /32 MOV [edi],esi inc edi ?
64k aliasing conflicts are cache evictions associated with mapping conflicts between multiple data regions. Vtune will show "a lot" of them, no matter what. Changes in number reported by Vtune, by less than a factor of 2, may not be meaningful. The usual prescription for alleviating them is to offset the relative placement of data regions which need to share cache. You want to avoid needing cache lines which are exactly 64k apart. You have given no information to indicate whether this may be relevant. 64k aliasing has been recognized as a basic deficiency which should be remedied in the latest round of steppings, so the need to deal with it should be limited to chips produced in the past.
To fix 64k aliasing conflicts you need to step through your code w/ a debugger and look at the different memory addresses you are loading from and storing to. If you see addresses that are offset by a multiple of 64k (to be exact with address bits 7-15 the same since we really care about cache lines here not individual addresses) then you can have a potential aliasing problem. However from the code you provided it does not look like this is the problem. It looks like this code runs for too short of a time to really capture any useful data w/ VTune. You should make your loop run for at least 10s. Then see what the performance impact of your code changes are. The performance differences (38ms vs 25ms) may just be noise.
However if it really is a 64k aliasing problem you can fix it by padding your data structures so the access patterns arent multiples of 64k aparat