- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
the following code fragment shows up some strange behaviour that I couldn't find described in the docs.
#includeThis code runs in approx. 5.5 billion cycles.
#include
int main() {
float tmp;
volatile __m256 b = _mm256_broadcast_ss(&tmp);
asm volatile ("nop"); // align jump target to 16 byte
for (int i=0;i<1000000000;i++) {
asm volatile("vpmuludq %xmm1, %xmm0, %xmm1");
}
}
If I comment out the broadcast, the codes runs in 4.5 billion cycles. It seems that the broadcast sets the saved flag in the avx status reg and the AVX instructions in the loop never clear this flag and all following AVX instructions are suffering a penalty.
If I add a "vzeroupper" instruction after the broadcast the code runs in expected time.
Do I have missed something out, or could this be a bug (in documentation or cpu)?
My two questions are:
1. Why is the "saved" flag set, without using any SSE instruction?
2. Why isn't that flag cleared after running the first AVX instruction?
See http://www.intel.com/content/www/my/en/architecture-and-technology/64-ia-32-architectures-optimization-manual.html page 512 for the penalty I'm reffering to.
This behavior is the same using the intel compiler and the gcc.
Best regards,
Heiko
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The biggest issue in my opinion is, that even if a vzeroupper would be needed, a single avx instruction should bring the same result (with a penalty) than vzeroupper. I'm running a billion AVX instructions and every one of these is delayed!
The test code is generated only to reproduce the error. We have a big simulation project here which has the same runtime if we replace SSE instructions with AVX instructions and let the mailoop only run half the time. In theorie we're expecting nearly a 50% speedup.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
AVX on SandyBridge is 256bits wide for float point computations
where for integer ones is still 128bits asper SSE
vzeroupper not much needed in your case but let me check the situation in details...
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
We ran the code on a Xeon E31245 with the reported results. On an core i5 there was no penalty. So maybe this is a bug in the xeon?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Your code is not running into AVX<->SSE state problems as described on page 512 in the web article you reference.
Instead, you are running into a problem with bypass domains in the execute unit. If you look carefully at your code, you mix floating point (the broadcast) with integer (the vpmuludq). Since your main loop essentially measures the latency of the vpmuludq instruction, you expected to see 5 clocks. But because the XMM0 register was produced by a single precision floating point instruction (the broadcast), you see 5+1 clocks of latency each loop. This is the 20% penalty you observe. This is the reason that Intel recommends you match data types.
When you add the VZEROUPPER, you dont really fix the problem. Since your loop is very long, your code has an interaction with the OS state save/restore (e.g. after a timer interrupt). And since VZEROUPPER allows the machine to optimize state save instructions, this can alter the instruction that produces XMM0. (I.e. XMM0 might be written by an instruction in the OS state save, instead of by the broadcast). This is why the latency of your main loop changes.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page