How to solve it

caliniaru · ‎05-15-2006

It seems that the bug was a missing flush after the first 4 store operations. A flush is required for every cache line - which is 64 bytes, but subject to change.

Prefetching the destination makes little sense. And the performance remains the same after removing the prefetch on store operations (this was actually a build problem).

An lfence operation is useful in order to ensure serialization. The final code prefetches 2 cache lines, flushes the loads and flushes each 64 bytes writes.

Another interesting thing is that the compiler uses XMM0 register exclusively. The performance remains the same even for a written assembly routine which uses XMM6 to XMM13 registers.

caliniaru · ‎05-15-2006

.. this post was intended for the "SSE2 on EM64T and Win64" thread. The link is http://softwareforums.intel.com/ids/board/message?board.id=13&message.id=1260