- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I tried to improve performance for memory copy using sse on Xeon 5310 1.6G DDR2 667
here is my code for testing bandwidth for writing ram
rdtsc
movl %eax,time1
movl %edx,time1+4
loop:
movdqa %xmm0,(%edi)
movdqa %xmm1,16(%edi)
movdqa %xmm2,32(%edi)
movdqa %xmm3,48(%edi)
movdqa %xmm4,64(%edi)
movdqa %xmm5,80(%edi)
movdqa %xmm6,96(%edi)
movdqa %xmm7,112(%edi)
addl $128,%edi
dec %ecx
jnz loop
rdtsc
movl %eax,time2
movl %edx,time2+4
the problem is if ecx is set from 0 to 31 (0 to 4kB), the total cost is 1xxx clocks, and when ecx is set to 32 to
64(4kB to 8kB), the cost rises to 6xxx clocks. It seems every 4kB block will cause a worse jump (5xxx clocks).
I tried to prefetch 4kB ahead before the loop, for instance
movl %eax,4096(%edi)
movl %eax,8192(%edi)
but each prefetch will cost 5xxx clocks, so it can't help. I also tried to use movntdq, but it got worse.
accroding to the current result, the bandwidth for writing can't exceed 1GB/s. The ram I installed is ddr2 667, I
think it has a theoretical bandwidth of 5GB/s. Is this a OS issue or CPU cache issue? BTW OS is Linux Kernel 2.6.9-
78
any ideas will be appreciated
thanks
Link Copied
1 Reply
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have moved this from the General Contest Questions forum, since it's more of a general programming question.
==
Aubrey W.
Intel Software Network Support
==
Aubrey W.
Intel Software Network Support
Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page