Performance Issues on AVX Instructions

heiko77 · ‎11-25-2011

Hello,

the following code fragment shows up some strange behaviour that I couldn't find described in the docs.

#include
#include

int main() {
float tmp;
volatile __m256 b = _mm256_broadcast_ss(&tmp);

asm volatile ("nop"); // align jump target to 16 byte

for (int i=0;i<1000000000;i++) {
asm volatile("vpmuludq %xmm1, %xmm0, %xmm1");
}

}

This code runs in approx. 5.5 billion cycles.

If I comment out the broadcast, the codes runs in 4.5 billion cycles. It seems that the broadcast sets the saved flag in the avx status reg and the AVX instructions in the loop never clear this flag and all following AVX instructions are suffering a penalty.

If I add a "vzeroupper" instruction after the broadcast the code runs in expected time.

Do I have missed something out, or could this be a bug (in documentation or cpu)?

My two questions are:
1. Why is the "saved" flag set, without using any SSE instruction?
2. Why isn't that flag cleared after running the first AVX instruction?

See http://www.intel.com/content/www/my/en/architecture-and-technology/64-ia-32-architectures-optimization-manual.html page 512 for the penalty I'm reffering to.

This behavior is the same using the intel compiler and the gcc.

Best regards,
Heiko

heiko77 · ‎11-29-2011

Can anybody confirm this behaviour as a bug? Or is any further information needed? This issue is a real show stopper for us, so we really have to make some decisions...

Brijender_B_Intel · ‎11-29-2011

Can you please tell which compiler are you using? Also, did you try to look at the assembly generated for differnet configuration? it may tell you whether compiler is generating vzeroupper correctly in all cases or not?

heiko77 · ‎11-29-2011

I used intel compiler and gcc, both are generating the same assembly code without any vzeroupper. In my opinion thats correct, because I'm not using any SSE code. In my understanding vzeroupper is only for mixing SSE and AVX.

The biggest issue in my opinion is, that even if a vzeroupper would be needed, a single avx instruction should bring the same result (with a penalty) than vzeroupper. I'm running a billion AVX instructions and every one of these is delayed!

The test code is generated only to reproduce the error. We have a big simulation project here which has the same runtime if we replace SSE instructions with AVX instructions and let the mailoop only run half the time. In theorie we're expecting nearly a 50% speedup.

Maxym_D_Intel · ‎12-06-2011

AVX on SandyBridge is 256bits wide for float point computations
where for integer ones is still 128bits asper SSE

vzeroupper not much needed in your case but let me check the situation in details...

heiko77 · ‎12-07-2011

Thank you very much!

We ran the code on a Xeon E31245 with the reported results. On an core i5 there was no penalty. So maybe this is a bug in the xeon?

Maxym_D_Intel · ‎12-09-2011

Your code is not running into AVX<->SSE state problems as described on page 512 in the web article you reference.

Instead, you are running into a problem with bypass domains in the execute unit. If you look carefully at your code, you mix floating point (the broadcast) with integer (the vpmuludq). Since your main loop essentially measures the latency of the vpmuludq instruction, you expected to see 5 clocks. But because the XMM0 register was produced by a single precision floating point instruction (the broadcast), you see 5+1 clocks of latency each loop. This is the 20% penalty you observe. This is the reason that Intel recommends you match data types.

When you add the VZEROUPPER, you dont really fix the problem. Since your loop is very long, your code has an interaction with the OS state save/restore (e.g. after a timer interrupt). And since VZEROUPPER allows the machine to optimize state save instructions, this can alter the instruction that produces XMM0. (I.e. XMM0 might be written by an instruction in the OS state save, instead of by the broadcast). This is why the latency of your main loop changes.