Intel® ISA Extensions
Use hardware-based isolation and memory encryption to provide more code protection in your solutions.

Very slow performance of VMOVNTDQ instruction

Zack_-_
Beginner
1,170 Views

Hello,

I'm using VMOVNTDQ as an experiment in a benchmark. The idea was to use it to write to memory bypassing the caches. This seems to work fine with the MOVNTDQ instruction but with VMOVNTDQ the performance is very slow, around 300 MB /sec bandwidth, versus 6000 MB/sec for MOVNTDQ.

Anyone know why VMOVNTDQ is sluggish?

Thanks.

0 Kudos
11 Replies
SergeyKostrov
Valued Contributor II
1,170 Views
Could you post a test-case, please?
0 Kudos
Zack_-_
Beginner
1,170 Views
To really visualize this problem, check out the program bandwidth 0.32k: http://zsmith.co/bandwidth.html#download In main.c, at line 2551, change #if 0 to #if 1. This will enable the use of VMOVNTDQ. Then run the program. The graph that it produces shows the performance issue. At chunk sizes of more than 512 bytes, VMOVNTDQ writes to memory very slowly. Thanks.
0 Kudos
SergeyKostrov
Valued Contributor II
1,170 Views
I've expected a small test-case. Unfortunately, there are some dependencies, like: ...Lastly, to compile for 32-bit desktop Windows you need the GCC toolchain in the form of Cygwin... and I simply don't have time to install all that stuff. I think you need to isolate your problem in a small reproducer with two cases for MOVNTDQ instruction and for VMOVNTDQ instruction.
0 Kudos
Zack_-_
Beginner
1,170 Views
Don't you have access to a Linux-based machine at Intel? Are you someone who is responsible for the AVX part of the microcode? If not, who is? Who is doing QA for the AVX microcode? Surely someone at Intel is being paid to identify and investigate bugs like these, whereas I am not being paid ...
0 Kudos
SergeyKostrov
Valued Contributor II
1,170 Views
Please take into account that even if you've submitted the post to IDZ forum it doesn't guarantee for 100% that your problem will be solved, explained, etc. However, you could try to request help from Intel Premium support. Best regards, Sergey
0 Kudos
Zack_-_
Beginner
1,170 Views
It's not my problem, it's Intel's problem -- it's apparently a microcode bug. Shouldn't you be concerned that I've identified a defect?
0 Kudos
Xiancai_L_
Beginner
1,170 Views

Yes,When I use AVX instruction in sandybrige,it‘s very slow。Why?Thanks!

0 Kudos
Bernard
Valued Contributor I
1,170 Views

Xiancai L. wrote:

Yes,When I use AVX instruction in sandybrige,it‘s very slow。Why?Thanks!

Can you post a test case of your program?

0 Kudos
Xiancai_L_
Beginner
1,170 Views

Hi iliyapolak,I used bandwidth to test AVX instruction.I see the performance will terrible when mix used with SSE.

http://zsmith.co/bandwidth.html#download

0 Kudos
Bernard
Valued Contributor I
1,170 Views

>>>Hi iliyapolak,I used bandwidth to test AVX instruction.I see the performance will terrible when mix used with SSE>>>

Thanks for the link.Regarding the poor performance of the AVX instruction intermixed with the SSE it is well known issue.Because the hardware must save and restore upper context of YMMn register it will incur apenalty of few dozens of cycles.AVX 128-bit instruction with automatically zero the upper half of YMM registers it is not the case when you use legacy SSE instruction because they do not have a "knowledge" of wider 256-bit registers.You can use Intel SDE to detect an penalty of AVX-to-SSE transition.

0 Kudos
SergeyKostrov
Valued Contributor II
1,170 Views
>>...I used bandwidth to test AVX instruction.I see the performance will terrible when mix used with SSE. This is the problem with the Bandwidth software because it should evaluate performance of SSE codes only and AVX codes only without mixing them. Intel clearly stated that there is a performance penalty when mixing SSE and AVX instructions. Take a look at a very good thread related to that subject: Forum Topic: AVX transition penalties and OS support Web-link: software.intel.com/en-us/forums/topic/364851
0 Kudos
Reply