- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
I'm using VMOVNTDQ as an experiment in a benchmark. The idea was to use it to write to memory bypassing the caches. This seems to work fine with the MOVNTDQ instruction but with VMOVNTDQ the performance is very slow, around 300 MB /sec bandwidth, versus 6000 MB/sec for MOVNTDQ.
Anyone know why VMOVNTDQ is sluggish?
Thanks.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Yes,When I use AVX instruction in sandybrige,it‘s very slow。Why?Thanks!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Xiancai L. wrote:
Yes,When I use AVX instruction in sandybrige,it‘s very slow。Why?Thanks!
Can you post a test case of your program?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi iliyapolak,I used bandwidth to test AVX instruction.I see the performance will terrible when mix used with SSE.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>>Hi iliyapolak,I used bandwidth to test AVX instruction.I see the performance will terrible when mix used with SSE>>>
Thanks for the link.Regarding the poor performance of the AVX instruction intermixed with the SSE it is well known issue.Because the hardware must save and restore upper context of YMMn register it will incur apenalty of few dozens of cycles.AVX 128-bit instruction with automatically zero the upper half of YMM registers it is not the case when you use legacy SSE instruction because they do not have a "knowledge" of wider 256-bit registers.You can use Intel SDE to detect an penalty of AVX-to-SSE transition.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page