Very slow performance of VMOVNTDQ instruction

Zack_-_ · ‎11-04-2012

Hello,

I'm using VMOVNTDQ as an experiment in a benchmark. The idea was to use it to write to memory bypassing the caches. This seems to work fine with the MOVNTDQ instruction but with VMOVNTDQ the performance is very slow, around 300 MB /sec bandwidth, versus 6000 MB/sec for MOVNTDQ.

Anyone know why VMOVNTDQ is sluggish?

Thanks.

SergeyKostrov · ‎11-07-2012

Could you post a test-case, please?

Zack_-_ · ‎11-08-2012

To really visualize this problem, check out the program bandwidth 0.32k: http://zsmith.co/bandwidth.html#download In main.c, at line 2551, change #if 0 to #if 1. This will enable the use of VMOVNTDQ. Then run the program. The graph that it produces shows the performance issue. At chunk sizes of more than 512 bytes, VMOVNTDQ writes to memory very slowly. Thanks.

SergeyKostrov · ‎11-08-2012

I've expected a small test-case. Unfortunately, there are some dependencies, like: ...Lastly, to compile for 32-bit desktop Windows you need the GCC toolchain in the form of Cygwin... and I simply don't have time to install all that stuff. I think you need to isolate your problem in a small reproducer with two cases for MOVNTDQ instruction and for VMOVNTDQ instruction.

Zack_-_ · ‎11-08-2012

Don't you have access to a Linux-based machine at Intel? Are you someone who is responsible for the AVX part of the microcode? If not, who is? Who is doing QA for the AVX microcode? Surely someone at Intel is being paid to identify and investigate bugs like these, whereas I am not being paid ...

SergeyKostrov · ‎11-08-2012

Please take into account that even if you've submitted the post to IDZ forum it doesn't guarantee for 100% that your problem will be solved, explained, etc. However, you could try to request help from Intel Premium support. Best regards, Sergey

Zack_-_ · ‎11-08-2012

It's not my problem, it's Intel's problem -- it's apparently a microcode bug. Shouldn't you be concerned that I've identified a defect?

Xiancai_L_ · ‎02-22-2013

Yes，When I use AVX instruction in sandybrige，it‘s very slow。Why？Thanks！

Bernard · ‎02-23-2013

Xiancai L. wrote:

Yes，When I use AVX instruction in sandybrige，it‘s very slow。Why？Thanks！

Can you post a test case of your program?

Xiancai_L_ · ‎03-03-2013

Hi iliyapolak,I used bandwidth to test AVX instruction.I see the performance will terrible when mix used with SSE.

http://zsmith.co/bandwidth.html#download

Bernard · ‎03-04-2013

>>>Hi iliyapolak,I used bandwidth to test AVX instruction.I see the performance will terrible when mix used with SSE>>>

Thanks for the link.Regarding the poor performance of the AVX instruction intermixed with the SSE it is well known issue.Because the hardware must save and restore upper context of YMMn register it will incur apenalty of few dozens of cycles.AVX 128-bit instruction with automatically zero the upper half of YMM registers it is not the case when you use legacy SSE instruction because they do not have a "knowledge" of wider 256-bit registers.You can use Intel SDE to detect an penalty of AVX-to-SSE transition.

SergeyKostrov · ‎03-05-2013

>>...I used bandwidth to test AVX instruction.I see the performance will terrible when mix used with SSE. This is the problem with the Bandwidth software because it should evaluate performance of SSE codes only and AVX codes only without mixing them. Intel clearly stated that there is a performance penalty when mixing SSE and AVX instructions. Take a look at a very good thread related to that subject: Forum Topic: AVX transition penalties and OS support Web-link: software.intel.com/en-us/forums/topic/364851