The solutions to this problem seen hitherto are the following:
The announced extension from 128-bit XMM to 256-bit YMM will use a combination of the above methods, according to the preliminary info published by Intel (http://softwareprojects.intel.com/avx/). To recap the documentation, all instructions that write to an XMM register will have two versions: A legacy version that modifies the lower half of the 256-bit register and leaves the upper part unchanged, and a new version of the same instruction with a VEX prefix that zeroes the upper half of the register. So the VEX version of a 128-bit instruction uses method (5) above. It is not clear whether the legacy version of 128-bit instructions will use method (2), (3) or (4). A new instruction VZEROUPPER clears the upper half of all the YMM registers, according to method (6).
Now, I wonder if we really need the complexity of having two versions of all 128-bit instructions. The possibility of writing to the lower half of a YMM register and leave the upper half unchanged is needed only in the following scenario: A function using a full YMM register calls a legacy function which is unaware of the YMM extension but saves the corresponding XMM register before using it, and restores the value before returning. The calling function can then rely on the full YMM register being unchanged.
However, this scenario is only relevant if the legacy function saves and restores the XMM register, and this happens only in 64-bit Windows. The ABI for 64-bit Windows specifies that register XMM6 - XMM15 have callee-save status, i.e. these registers must be saved and restored if they are used. All other x86 operating systems (32-bit Windows, 32- and 64-bit Linux, BSD and Mac) have no XMM registers with callee-save status. So this discussion is relevant only to 64-bit Windows. There can be no problem in any other operating system because there are no legacy functions that save these registers anyway.
The design of the AVX instruction set allows a possible amendment to the ABI for 64-bit Windows, specifying that YMM6 - YMM15 should have callee-save status. The advantage of callee-save registers is that local variables can be saved in registers rather than in memory across a call to a library function.
The disadvantage of this hypothetical specification of callee-save status to YMM6 - YMM15 in a future Windows 64 ABI is that we will have a penalty for reading a full YMM register after saving and restoring the partial register. The cost of this is unknown as long as it has not been revealed whether method (2), (3) or (4) will be used. I assume, however, that the penalty will not be insignificant because Intel designers wouldn't have defined the VZEROUPPER instruction and recommended the use of it unless there is some situation where the penalty of partial register access is higher than the cost of zeroing the upper half of all sixteen YMM registers. But if VZEROUPPER is used for reducing th e penalty of partial register access then we have destroyed the advantage of callee-save status because all the YMM registers are destroyed anyway. This is a catch-22 situation! If there is a significant penalty to partial register access then there is no point in defining callee-save status to YMM registers. If VZEROUPPER uses 16 micro-ops then I can't imagine any situation where it saves time. Either VZEROUPPER is very fast, or the penalty for partial register access is very high, or the use of VZEROUPPER is never advantageous. Can somebody please clarify?
So if my assumptions are correct, then the advantage of having two different versions of all 128-bit instructions is minimal at best. Now, let's look at the disadvantages:
If we have two versions of every library function then we don't have to care about YMM registers being saved across a call to a legacy library function, because the compiler will insert a call to the VEX version of the function, which can save and restore the full YMM registers if required by the ABI.
It would be nice to have some indication of whether the penalty for mixing VEX and non-VEX XMM instructions is so high that we need separate VEX and non-VEX versions of all library functions. It would also be nice to know if there are any situations where the partial register penalty is higher than the time it takes to execute the VZEROUPPER instruction.
The solution of having two versions of all XMM instructions looks to me like a shortsighted patch in an otherwise well designed and future-oriented ISA extension. The problem will appear again in all future extensions of the size of the vector registers. How do you plan to solve the problem next time the register size is increased? Will we have two versions of every YMM instruction when ZMM is introduced, and two versions of every ZMM instruction when.... This would be a waste of th e few unused bits that are left in the VEX prefix.
It is probably too late to change the AVX spec now, although it looks to me like a draft, published prematurely as a response to AMD's SSE5 stunt?
Hi Professor Fog,
Thank you for your detailed and insightful comments.
Among the important boundary conditions to add to your list is that some of todays drivers use legacy SSE. These are being reached via interrupt and the existing drivers cant save the upper part of the live YMM registers. Theoretically a new OS could take care of this for the ISR, but we didnt want to penalize users of the legacy architecture and could not mandate a major OS re-write.
Over a decade ago when we defined legacy SSE instructions we had no vision of the way we were going to extending the vector length. The new Intel AVX 128 instructions are defined as zeroing the upper part of the register (bit 128 to infinity), and we have a new state management (XSAVE/RSTOR) that can manage state in a forward compatible manner. So if and when we will extend the vector size further the magnitude of this issue should be much smaller (driver writers, you have been warnedJ).
So in brief, our solution was:
1) Legacy and new code can intermix, the first transition from 256b usage to legacy 128b will have a transition penalty (in Sandy bridge implementation it will cost several tens of cycles depending on whats in the pipe, in extreme cases it can be much longer). The HW will take care to save the upper parts and later restore (with a similar performance penalty) when returning to 256b code.
2) The compiler will map C or 128b instrisics to either the legacy 128b or the new zeroing-upper 128b forms based on a switch. A programmer could also write in assembly, and they need to keep the above in mind. Inline assembly can also be encoded to the new zeroing-upper 128b forms based on a switch. The expected result of this is that mixing legacy SSE and new 256b instructions would mostly happen at or above the function level.
3) The vZeroUpper instruction is cheap and the recommended ABI is to save the live upper part of the YMM register and issue vZeroUpper prior to calling to an external function. BTW, it will be a good programming practice to "deallocate" the upper part (using vZeroUpper) or the whole registers (using vZeroAll) when finishing part of the program that uses this registers. (optimization recommendation: This will help free up resources in the OOO machine and reduce time spent in task switches). The result of this will be that only the small overhead of vZeroUpper will be the cost of this scheme.
4) We will add performance events that count the two transitions from #1 above to help with debugging any SSE->256b transitions.
To help understand the decision process a bit better, allow me to provide some historical context. Over the past several years we evaluated all the options you have above, but particularly options 1 and 5 (and of course the current option...). Options involving partial dependencies or partial state management were difficult to make fast and/or power efficient. It also did not seem right that the new state should penalize all users of the legacy architecture".
Your "option 1" is in a couple of ways a desirable architecture. It avoids any concern of legacy code perturbing your new state. On the other hand, there are advantages to being able to interoperate cleanly with the current XMM register state: it is entrenched in the ABI's (as you note, XMM is used for argument passing), and we wanted to bring the encoding advantages of the NDS and new operations to scalar and short vector (scalar dominates most FP code today).. The big problem is that doubling the name space for what is, in effect, the same functionality, is inefficient both inside the processor as well as in the state save image.
Your "option 5" (the legacy instructions to zero the new upper state) is really attractive. The real issue here is that popular operating systems do not preserve all state on switching to an interrupt service routine (ISR). A surprising number of Ring-0 components active during ISR use XMM's - if only to move or initialize data. There are only two solutions to this problem: the OS must save all the new state or the drivers must be rewritten. There is no way we could compel the industry to rewrite/fix all of their existing drivers (for example to use XSAVE) and no way to guarantee they would have done so successfully. Consider for example the pain the industry is still going through on the transition from 32 to 64-bit operating systems! The feedback we have from OS vendors also precluded adding overhead to the ISR servicing to add the state management overhead on every interrupt. We didn't want to inflict either of these costs on portions of the industry that don't even typically use wide vectors. Architecturally, therefore, we had to prevent legacy drivers from having side effects on the new (upper) state. This means they had to merge. New drivers, aware of how to use XSAVE to manage AVX state in a forward compatible manner so as not to break apps. So for the AVX-prefixed (short vector or scalar) instructions we allow a zeroing behavior.
What we decided to do was to optimize for the common scenario of uniform blocks of 128-bit (or all scalar) code separated from uniform blocks of 256-bit code. We maintain an internal record of when we transition between states where the upper bits contain something nonzero - to a point where the state is guaranteed to be zero. We give you a fast (1* cycle throughput) way to the second state VZEROUPPER (though VZEROALL, XRSTOR and reboot also work). Once you're in that state of Zeroed-upperness, you can execute 128-bit code (or scalar) - VEX prefixed or not - and you pay no transition penalty. You can also transition back to executing 256-bit instructions also with no penalty. You can transition freely between any VEXed instruction of any width and pay no penalty. The downside is that if you try to move from 256-bit instructions to legacy 128-bit instructions without that VZEROUPPER, you're going to pay. The way we chose to make you pay is to optimize for the common use of long blocks of SSE code you pay once during the transition to legacy 128-bit code instead of on every instruction. We do it by copying the upper 128-bits of all 16 regis ters to a special scratchpad and this copying takes time (something like 50 cycles - still TBD). Then the legacy SSE code can operate as long as it wants with no penalty. When you transition back to a VEX-prefixed instruction, you have to pay the penalty again to restore state from that scratchpad.
The solution for this problem is for software to use VZEROUPPER prior to leaving your basic block of 256-bit code. Use it prior to calling any (ABI compliant) function, and prior to any blind return jump.
We've also expanded the ABI in a backwards compatible way - functions can declare and pass 256-bit arguments if so declared, but >>new state is caller save<<. Making ymm6-15 as callee save on Windows-64 would require changes in several runtime libraries related to exception handling and stack unwinding, making the changes incompatible with the current libraries. Having all new state as caller-save is slightly inefficient but the gain due to full backward compatibility outweighs the loss of performance.For full details see the ABI extensions for each OS see the Spring Intel Developer Forum AVX presentation (https://intel.wingateweb.com/SHchina/published/NGMS002/SP_NGMS002_100r_eng.pdf). For code that doesn't vectorize, mixture of scalar (or 128-bit vector) AVX and legacy 128-bit instructions is completely painless so long as the upper state is zero previously - I would rely on compliant callers to issue VZEROUPPER or if paranoid you could zero yourself prior to the executing the function body).
So back to your scenario: A function using 256-bit AVX instructions cannot assume the callee will not modify the high bits in the YMMs. Bits 128-255 of the YMMs are caller save. The caller must also issue a VZEROUPPER in case that callee hadn't ported itself to use AVX. There's always a tradeoff in caller/callee save and the benefit here is that the callee that wasnt to use 256-INT doesn't have to worry about preserving this new state.
> Now, I wonder if we really need the complexity of having two versions of all 128-bit instructions
The non-destructive source and the compact encoding are >major< performance features that apply to scalar and short vector forms. Indeed, I expect the performance upside of these on general purpose, compiled code (that doesn't always vectorize so well) to match the upside of the wider vector width. New operations like broadcast and true masked loads and stores and the 4th operand (for the new 3-source instructions) are only available under the VEX prefix. You can freely intermix 128-bit VEX instructions with legacy 128-bit instructions. And you can intermix blocks of 256-bit instructions with legacy 128-bit instructions so long as you follow two rules: use VZEROUPPER prior to leaving a basic block containing 256-bit and adhere to caller-save ABI semantics on new state.
> There will be a penalty for mixing the legacy XMM instructions using partial register writes with any of the full YMM instructions. Is there a penalty only when reading a full register after writing to the partial register, or are there other situations where mixing instructions with and without VEX causes delays?
Without VZEROUPPER, there will be a penalty when moving from 256-bit instructions to legacy 128-bit instructions, and another penalty when moving from 128-bit instructions back to 256-bit instructions. Consider this code:
VADDPS ymm0, ymm1, ymm2
ADDSS xmm3, xmm4
VSUBPS ymm0, ymm0, ymm2
Here we would have a penalty on the ADDSS and a second penalty on the VSUBPS. (In this case it would be best just to use the VEX form of ADDSS).
> Compilers will need a switch for compiling 128-bit XMM instructions with or without VEX prefix. Software developers will have problems avoiding the penalty of mixing code with and without VEX prefixes.
Absolutely correct on the compiler switch (the compiler has a switch to select between AVX and legacy SSE generation on the Intel compiler its QxG). But since the compilers (at least for high level languages) will generate VZEROUPPER and caller-save semantics on new state whenever they succeed in autovectorizing, there won't be penalties. There are still transitions penalties that happen due to asynchronous events (such as ISR's that use XMM's) but these are not common enough to be a significant performance concern.
The tools will not generate VZEROUPPER on intrinsics or assembly - here, we give the developer the full right/control to shoot their performance in the foot. It is a particularly concern of mine that people writing intrinsics be aware they still have to use VZEROUPPER!
> It will be very hard for programmers using vector intrinsics to avoid mixing the different kinds of instructions.
I hope it will not be too hard, but it is a concern. I've ported a number of applications with both an internal version of the Intel compiler - with and without intrinsic and assembly versions of some hotspots - and the transitions haven't shown up as a noticeable contribution. Autovectorizing compilers like the Intel compiler don't have any problems. In the Intel compiler, when you compile with QxG, all your intrinsics (128-bit or 256-bit) get a VEX prefix. The only diligence required on the part of the programmer is to ensure that you have a VZEROUPPER prior to leaving that block of intrinsics. We have a tool used internally (the Intel software development emulator) that identifies any transitions. We also plan to have a performance monitoring counter in the hardware to allow tools like Intel VTune to show transition penalties.
> Do you expect all function libraries using XMM registers to have two versions of every function: a legacy version for backwards compatibility, and a version with VEX prefixes on all XMM instructions for calling from procedures that use YMM registers?
What we wouldn't do is have lots of library versions just to work around the lack of a callee-save ABI. We would have (and do today) support multiple versions of our per formance libraries optimized for different processors. But the 80/20 rule applies: most library functions are not critical performance bottlenecks and the legacy (non-VEXed) implementations work there just fine - with no transition penalties to apps that use 256-bit AVX. In the long run of course, there are no advantages to the non-VEX forms and we would like to move the industry in the direction of AVX.
> The solution of having two versions of all XMM instructions looks to me like a shortsighted patch in an otherwise well designed and future-oriented ISA extension
> The problem will appear again in all future extensions of the size of the vector registers. How do you plan to solve the problem next time the register size is increased? Will we have two versions of every YMM instruction when ZMM is introduced, and two versions of every ZMM instruction when.... This would be a waste of the few unused bits that are left in the VEX prefix.
I have a much stronger feeling that we have reached an excellent compromise. There are significant benefits to the non-destructive destination and new instruction forms that apply to scalar and short vector operations - and if you don't want/need to port, we interoperate at very high performance with the legacy instructions. Software and tools developers have very few rules to avoid performance overhead 1) when you use 256-bit forms of instructions, make sure you VZEROUPPER when leaving the block and 2) realize that the ABI remains caller save for new state. And of course - don't intentionally write code like the example aboveJ
I'm glad you raised the future of ZMM. If/when we adopt the next natural extension (to 512 bits or whatever), we would face the same problems with 'Legacy 256-bit' so long as the asynchronous parts of the system are not saving state in a forward-looking way, or the OS does not decide to step in and manage state on ISR's. So this is a call to the industry planning to adopt AVX in their interrupt service routines: use XSAVE. It's there to guarantee you can save both todays
> It is probably too late to change the AVX spec now, although it looks to me like a draft, published prematurely as a response to AMD's SSE5 stunt?
Let's keep having the dialogue, but I believe the software apps will be greatly benefitted by the present architecture (and microarchitecture). As we did with the Nehalem instructions, we believe its best for the software to have lead time to prepare the OS, compilers, tools, and apps early. The AVX spec is not a draft :)
|Instruction / next state
Outstanding quesitons again, my attempt to respond to the technical ones:
1) Comments regarding risk of using YMM at ring-0 are correct. The example given is similar to the current scenario where a ring-0 driver (ISR) vendor attempts to use floating-point state, or accidentally links it in some library, in OSs that do not automatically manage that context at Ring-0. This is a well known source of bugs and I can suggest only the following:
a. On those OSs, driver developers are discouraged from using floating-point or AVX
b. Driver developers should be encouraged to disable hardware features during driver validation (i.e. AVX state can be disabled by drivers in Ring-0 through XSETBV() ).
2) Correct on the state table. That's a nice way of representing these transitions. Also correct on the transition cost except that c->a is not fast. The transitions between b->c and c->b are indeed serializing with respect to other instructions that modify this state and transition penalty varies (on the first implementation) according to how much is in flight. Instructions on nontaken (speculative) paths do not cause transition penalties. The need to protect the scratchpad from speculation means c->a is slow.
(Bob Valentine, our front end architect, provided me the following using your notation):
a b c
a fast fast ---
b fast fast slow
c slow slow fast
3) > Compilers should put a VEX prefix on all intrinsic functions if compiling for AVX.
Correct, although #pragmas exist that allow the user to specify how a block of intrinsics should be compile.
> Compilers should put VZEROUPPER before and after any call to legacy functions if compiling for AVX
>Compilers should put VZEROUPPER before any call to an ABI compliant function and prior to leaving the basic block. Its not necessary to put VZEROUPPER after the ABI function. Analysis may optimize this use. I would prefer not to use legacy function since its not generally possible for a compiler to know that and even new functions or library might use legacy SSE instructions.
4) >How can the compiler know if a library function uses VEX or not unless we have two versions of every library function?
We intend no distinction between legacy and current ABI: whatever extensions are there are caller save or (for the special case of 256-bit argumnets) represented in the signature. Not only should the compiler not have to know if a function has legacy SSE, we discourage it as the callee is free to make whatever mixed use of legacy SSE and AVX they wish (for example to optimize in case only scalar code is generated). As a result, the caller must save all upper state prior to entering a ny ABI compliant function.
For example: call function void libfoo(void) where YMM3 and YMM8 are live.
// save live state
VMOVAPS [rsp+off], YMM3 // xmm3 is volatile (hence entire YMM3) is volatile
VEXTRACTF128 [rsp+off+32], YMM8 // xmm8 is nonvolatile, but upper bits are volatile. This is a fast instruction!
// mandatory since we have nonzero state YMM upper
VMOVAPS YMM3, [rsp+off]
VINSERTF128 YMM8, YMM8, [rsp+off+32]
Note that I explicitly discount non-ABI compliant calling conventions (e.g. fastcall) or interprocedural optimizations. Any compiler is free to innovate in whatever way they want in that respect, if they can convince the developer to use those features.
5) > Early publication and public discussion of proposed ISA changes is the best way to make the changes less painful. Please continue this policy.
Thank you and I appreciate your feedback!
Agner's questions are indeed something to think about. To bad he isn't in the decision process.
Mark, all this looks awfully complicated for me, and I am experienced with asm/intrinsic code optimization. What do you think other less experienced developers who will be forced to deal with this mess will say, or even worse do?
There is also another valid question — Intel compiler will take care of those things but we all know that Windows n+1 will be compiled using Microsoft compiler and Linux n+1 will be compiled using gcc. How well do you think they will handle all this?
I haven't given AVX much thought so far given it is still a distant future for me, but after reading all this my opinion is that the software development will take a step backwards with it.
You have penalized developers by sheer design complexity, instead of penalizing those users who will be sticking to old hardware with broken drivers in 2010. That is just silly and I am disappointed.
It would be cheaper and more efficient in the long term if Intel just bought out those few companies producing bad drivers and fixed them, not to mention that the only affected OS is hot-patchable.
(bows politely in front of the sensei) You are welcome Mr. Fog!
But let me get this straight — I never said AVX doesn't have any good parts. I like three operand syntax and wider registers and some new instructions (although gather and scatter are still missing!). I just think that VZEROUPPER is EMMS all over again. We didn't really need that.
In my opinion this is still a hack — instead of doing it this way (supposedly to keep compatibility) it would have been better if OS vendors were forced to save/restore full state on context switch at all priority levels, and if the silicon wasted on implementing two versions of each instruction, fast VZEROUPPER and register scratchpad was used to dramatically reduce the cost of a full state save/restore on those context switches.
If Intel has already come out with all this information about AVX way ahead of the actual hardware, they could have as well insisted on doing it right this time — they should have asked Microsoft and hardware ISVs to fix their existing OS and drivers, or at least make the future OS and drivers right and accept that some users will have to upgrade their OS or a piece of hardware (which by the way is not uncommon when upgrading a CPU) if they want to enjoy the full benefits of AVX.
This looks to me like Intel engineers have designed an aeroplane which you can use to drive on a road but the road must be compatible or else you have to get out, dismount wings, drive through the legacy road, get out, mount wings again, etc. I agree this is an improvement compared to say SSE3 but it is still long way from a clean cut of the Gordian Knot x86 ISA has became.
3) > Compilers should put a VEX prefix on all intrinsic functions if compiling for AVX.
Correct, although #pragmas exist that allow the user to specify how a block of intrinsics should be compile.
What pragmas are those?
Can I specify that a block (or a function) should compiled entirly without legacy code?
>> What pragmas are those?Can I specify that a block (or a function) should compiled entirly without legacy
This question has come up again now that YMM is extended to ZMM. The recommendations for the use of VZEROUPPER have been reversed for the Knights Landing processor. I have started a new thread for this discussion:
In Skylake, it was changed to use Agner's Method #4, except that there are no false dependencies or blend penalties with 128-bit instructions when the upper halves of all of the YMM registers are known to be clean, not just the YMM register being modified.
There is no one-time big penalty for saving and restoring the upper halves of YMM registers when transitioning anymore, but if the upper half of any YMM is dirty, then every non-VEX SSE instruction which modifies a XMM register serializes with any other instructions which access that XMM or the larger containing YMM, and it incurs a blend penalty, which involves issuing another uop to merge the original upper half into the result.
Any AVX software which calls SSE software must use vzeroupper / vzeroall before the call, or the SSE software will run 2-8x slower. In a loop, that would add up to much worse penalties than the one-time switchover penalties of pre-Skylake.