- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It seems that the bug was a missing flush after the first 4 store operations. A flush is required for every cache line - which is 64 bytes, but subject to change.
Prefetching the destination makes little sense. And the performance remains the same after removing the prefetch on store operations (this was actually a build problem).
An lfence operation is useful in order to ensure serialization. The final code prefetches 2 cache lines, flushes the loads and flushes each 64 bytes writes.
Another interesting thing is that the compiler uses XMM0 register exclusively. The performance remains the same even for a written assembly routine which uses XMM6 to XMM13 registers.
Prefetching the destination makes little sense. And the performance remains the same after removing the prefetch on store operations (this was actually a build problem).
An lfence operation is useful in order to ensure serialization. The final code prefetches 2 cache lines, flushes the loads and flushes each 64 bytes writes.
Another interesting thing is that the compiler uses XMM0 register exclusively. The performance remains the same even for a written assembly routine which uses XMM6 to XMM13 registers.
Link Copied
1 Reply
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
.. this post was intended for the "SSE2 on EM64T and Win64" thread. The link is http://softwareforums.intel.com/ids/board/message?board.id=13&message.id=1260
Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page