Its is known that a runtime penalty is ensued when I switch from AVX instructions to SSE unless I use vzeroupper/vzeroall to clean the upper halves of the ymm registers before the switch. Am I correct assuming that the cleanup is not needed if I only use lower halves of ymm registers in my AVX code (i.e. VEX-encoded SSE code)?
The application has both AVX and legacy SSE code, some of it in third party libraries. Thanks for the link to the paper. From it it looks like my assumption is correct.
I, for example always use the Intel Software Development Emulator as it is free.
You can get a report, if any transition penalties occure. This is very helpful. VTune also gives you the related code position that is responsible for this.
But you can get good results with Intel Software Development Emulator and the visual studio 2012 integration. Then you can debug an application with Intel Software Development Emulator. See here: http://software.intel.com/en-us/articles/intel-software-development-emulator#DEBUG-WIN
Specify "-oast <filename.txt>" as parameter fro Intel SDE. After debugging you get a file containing transition penalty information. I realized that if you use Intel SDE this way, you also get function name that is responsible for penalties.
I've verified on my SB and IB.. that when transitioning from SSE to AVX.. you don't have a penalty upon transitioning from one to the other so long as you refrain from using 256-bit instructions. If you use a 256-bit instruction the penalty is ~150 cycles, if you don't "vzeroupper" beforehand.
Sergey Kostrov wrote:
>>...If you use a 256-bit instruction the penalty is ~150 cycles, if you don't "vzeroupper"...
Thanks for that number and it looks like a real performance "killer". I wonder why these transitions are taking so many cycles? Isn't that some design issue(s) with CPUs that support AVX?
Really, is it 150 cycles? I thought it would be 75 cycles. The only thing I noticed is that you can get a very bad combination of AVX and SSE where there is a transition penalty immediately before and after an certain instruction. For example if you go AVX and have on SSE instruction and repeat this in a loop. Then I get the 150 cycles as combination of both transition penalties.
Nonetheless, the penalty is quite heavy for storing restoring all the YMM registers and some CPU states connected to this issue.
// EDIT:I can not test Intel SDE on Windows XP, I have switched to Windows 7 some time ago. The only thing I might test is Intel SDE on XP in VirtualBox, which I use still for some applications. Don't know if that has an big impact, running a CPU emulation tool under virtual machine. Please tell me, if you want me to do this.
I did a basic check: VirtualBox with Windows XP Proessional SP3, 32 bit.
Then I depacked Intel SDE to a directory and run an exe with AVX code from command line. Results of normal double code and avx code meet for different calculations. So everything seems to be finde. Code was created with VS2010 as for Intel Compiler or VS2012 I would have need other runtime packages to install. This would have taken some more time.
But in my mind, good thing that Intel SDE runs on XP in virtualized environment.