- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
I have a code which shows high level of load block due to store overlap (due to 4k aliasing) in vtune. I have implemented the same code using SSE and this bottleneck seem to have disappeared.
But I couldnt find any information if the 4k aliasing bottleneck effects or doesnt effect SSE code. All the examples use non-sse code. Is there any documentation on if SSE load/store instructions are somehow immune to this problem?
Thanks,
Evren
I have a code which shows high level of load block due to store overlap (due to 4k aliasing) in vtune. I have implemented the same code using SSE and this bottleneck seem to have disappeared.
But I couldnt find any information if the 4k aliasing bottleneck effects or doesnt effect SSE code. All the examples use non-sse code. Is there any documentation on if SSE load/store instructions are somehow immune to this problem?
Thanks,
Evren
Link Copied
1 Reply
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Evren,
Unless you are using non-temporal stores, the behaviour for scalar and SSE loads is the same. Might it be that your data access pattern changed when you rewrote your code for SSE (e.g. a load and a store that were a multiple of 4k apart now have a different distance)?
Kind regards
Thomas
Unless you are using non-temporal stores, the behaviour for scalar and SSE loads is the same. Might it be that your data access pattern changed when you rewrote your code for SSE (e.g. a load and a store that were a multiple of 4k apart now have a different distance)?
Kind regards
Thomas
Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page