I'm using VMOVNTDQ as an experiment in a benchmark. The idea was to use it to write to memory bypassing the caches. This seems to work fine with the MOVNTDQ instruction but with VMOVNTDQ the performance is very slow, around 300 MB /sec bandwidth, versus 6000 MB/sec for MOVNTDQ.
Anyone know why VMOVNTDQ is sluggish?
>>>Hi iliyapolak,I used bandwidth to test AVX instruction.I see the performance will terrible when mix used with SSE>>>
Thanks for the link.Regarding the poor performance of the AVX instruction intermixed with the SSE it is well known issue.Because the hardware must save and restore upper context of YMMn register it will incur apenalty of few dozens of cycles.AVX 128-bit instruction with automatically zero the upper half of YMM registers it is not the case when you use legacy SSE instruction because they do not have a "knowledge" of wider 256-bit registers.You can use Intel SDE to detect an penalty of AVX-to-SSE transition.