Analyzers
Talk to fellow users of Intel Analyzer tools (Intel VTune™ Profiler, Intel Advisor)
Announcements
FPGA community forums and blogs on community.intel.com are migrating to the new Altera Community and are read-only. For urgent support needs during this transition, please visit the FPGA Design Resources page or contact an Altera Authorized Distributor.

How to solve it

caliniaru
Beginner
820 Views
It seems that the bug was a missing flush after the first 4 store operations. A flush is required for every cache line - which is 64 bytes, but subject to change.

Prefetching the destination makes little sense. And the performance remains the same after removing the prefetch on store operations (this was actually a build problem).

An lfence operation is useful in order to ensure serialization. The final code prefetches 2 cache lines, flushes the loads and flushes each 64 bytes writes.

Another interesting thing is that the compiler uses XMM0 register exclusively. The performance remains the same even for a written assembly routine which uses XMM6 to XMM13 registers.
0 Kudos
1 Reply
caliniaru
Beginner
820 Views
.. this post was intended for the "SSE2 on EM64T and Win64" thread. The link is http://softwareforums.intel.com/ids/board/message?board.id=13&message.id=1260
0 Kudos
Reply