Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Zack_-_
Beginner
221 Views

Very slow performance of VMOVNTDQ instruction

Hello,

I'm using VMOVNTDQ as an experiment in a benchmark. The idea was to use it to write to memory bypassing the caches. This seems to work fine with the MOVNTDQ instruction but with VMOVNTDQ the performance is very slow, around 300 MB /sec bandwidth, versus 6000 MB/sec for MOVNTDQ.

Anyone know why VMOVNTDQ is sluggish?

Thanks.

0 Kudos
11 Replies
SergeyKostrov
Valued Contributor II
221 Views

Could you post a test-case, please?
Zack_-_
Beginner
221 Views

To really visualize this problem, check out the program bandwidth 0.32k: http://zsmith.co/bandwidth.html#download In main.c, at line 2551, change #if 0 to #if 1. This will enable the use of VMOVNTDQ. Then run the program. The graph that it produces shows the performance issue. At chunk sizes of more than 512 bytes, VMOVNTDQ writes to memory very slowly. Thanks.
SergeyKostrov
Valued Contributor II
221 Views

I've expected a small test-case. Unfortunately, there are some dependencies, like: ...Lastly, to compile for 32-bit desktop Windows you need the GCC toolchain in the form of Cygwin... and I simply don't have time to install all that stuff. I think you need to isolate your problem in a small reproducer with two cases for MOVNTDQ instruction and for VMOVNTDQ instruction.
Zack_-_
Beginner
221 Views

Don't you have access to a Linux-based machine at Intel? Are you someone who is responsible for the AVX part of the microcode? If not, who is? Who is doing QA for the AVX microcode? Surely someone at Intel is being paid to identify and investigate bugs like these, whereas I am not being paid ...
SergeyKostrov
Valued Contributor II
221 Views

Please take into account that even if you've submitted the post to IDZ forum it doesn't guarantee for 100% that your problem will be solved, explained, etc. However, you could try to request help from Intel Premium support. Best regards, Sergey
Zack_-_
Beginner
221 Views

It's not my problem, it's Intel's problem -- it's apparently a microcode bug. Shouldn't you be concerned that I've identified a defect?
Xiancai_L_
Beginner
221 Views

Yes,When I use AVX instruction in sandybrige,it‘s very slow。Why?Thanks!

Bernard
Black Belt
221 Views

Xiancai L. wrote:

Yes,When I use AVX instruction in sandybrige,it‘s very slow。Why?Thanks!

Can you post a test case of your program?

Xiancai_L_
Beginner
221 Views

Hi iliyapolak,I used bandwidth to test AVX instruction.I see the performance will terrible when mix used with SSE.

http://zsmith.co/bandwidth.html#download

Bernard
Black Belt
221 Views

>>>Hi iliyapolak,I used bandwidth to test AVX instruction.I see the performance will terrible when mix used with SSE>>>

Thanks for the link.Regarding the poor performance of the AVX instruction intermixed with the SSE it is well known issue.Because the hardware must save and restore upper context of YMMn register it will incur apenalty of few dozens of cycles.AVX 128-bit instruction with automatically zero the upper half of YMM registers it is not the case when you use legacy SSE instruction because they do not have a "knowledge" of wider 256-bit registers.You can use Intel SDE to detect an penalty of AVX-to-SSE transition.

SergeyKostrov
Valued Contributor II
221 Views

>>...I used bandwidth to test AVX instruction.I see the performance will terrible when mix used with SSE. This is the problem with the Bandwidth software because it should evaluate performance of SSE codes only and AVX codes only without mixing them. Intel clearly stated that there is a performance penalty when mixing SSE and AVX instructions. Take a look at a very good thread related to that subject: Forum Topic: AVX transition penalties and OS support Web-link: software.intel.com/en-us/forums/topic/364851
Reply