store bandwidth issue

incoming4u · ‎05-13-2010

Hi,

I tried to improve performance for memory copy using sse on Xeon 5310 1.6G DDR2 667
here is my code for testing bandwidth for writing ram

rdtsc
movl %eax,time1
movl %edx,time1+4

loop:
movdqa %xmm0,(%edi)
movdqa %xmm1,16(%edi)
movdqa %xmm2,32(%edi)
movdqa %xmm3,48(%edi)
movdqa %xmm4,64(%edi)
movdqa %xmm5,80(%edi)
movdqa %xmm6,96(%edi)
movdqa %xmm7,112(%edi)
addl $128,%edi
dec %ecx
jnz loop

rdtsc
movl %eax,time2
movl %edx,time2+4

the problem is if ecx is set from 0 to 31 (0 to 4kB), the total cost is 1xxx clocks, and when ecx is set to 32 to

64(4kB to 8kB), the cost rises to 6xxx clocks. It seems every 4kB block will cause a worse jump (5xxx clocks).
I tried to prefetch 4kB ahead before the loop, for instance

movl %eax,4096(%edi)
movl %eax,8192(%edi)

but each prefetch will cost 5xxx clocks, so it can't help. I also tried to use movntdq, but it got worse.
accroding to the current result, the bandwidth for writing can't exceed 1GB/s. The ram I installed is ddr2 667, I

think it has a theoretical bandwidth of 5GB/s. Is this a OS issue or CPU cache issue? BTW OS is Linux Kernel 2.6.9-

78

any ideas will be appreciated
thanks

Aubrey_W_ · ‎06-10-2010

I have moved this from the General Contest Questions forum, since it's more of a general programming question.

==
Aubrey W.
Intel Software Network Support