Analyzers
Community support for Analyzers (Intel VTune™ Profiler, Intel Advisor, Intel Inspector)
Announcements
The Intel sign-in experience has changed to support enhanced security controls. If you sign in, click here for more information.

How to solve it

caliniaru
Beginner
281 Views
It seems that the bug was a missing flush after the first 4 store operations. A flush is required for every cache line - which is 64 bytes, but subject to change.

Prefetching the destination makes little sense. And the performance remains the same after removing the prefetch on store operations (this was actually a build problem).

An lfence operation is useful in order to ensure serialization. The final code prefetches 2 cache lines, flushes the loads and flushes each 64 bytes writes.

Another interesting thing is that the compiler uses XMM0 register exclusively. The performance remains the same even for a written assembly routine which uses XMM6 to XMM13 registers.
0 Kudos
1 Reply
caliniaru
Beginner
281 Views
.. this post was intended for the "SSE2 on EM64T and Win64" thread. The link is http://softwareforums.intel.com/ids/board/message?board.id=13&message.id=1260
Reply