Software consequences of extending XMM to YMM

AFog0 · ‎06-18-2008

The extension of registers to the double size has happened several times in the history of the x86 ISA. Every time registers are extended to a larger size we have the problem with partial register access and false dependencies when legacy instructions write to the lower part of the register.

The solutions to this problem seen hitherto are the following:

Make the new registers independent of the previous smaller registers. This is the solution that was used in the transition from 64-bit MMX to 128-bit XMM. The advantage is that there is no false dependency. The disadvantage is that there are more registers to save on every task switch and that we need new instructions for moving data between the new and the old register set. The now-obsolete MMX registers and all instructions relating to them are still supported for the sake of backwards compatibility although they are rarely used.
Allow the hardware to split the register in two. A write to the lower half of an extended register is resolved by splitting the register in two independent registers of smaller size. This method is used in Intel Pentium Pro through Pentium M to handle 8-16-32 bits general-purpose registers. There is no false dependency as long as the two partial registers can be kept apart. But there is a penalty if the two halves have to be joined again by an instruction that reads the full register (for example for saving it on the stack). The two partial registers cannot be joined together until both have retired to the permanent register file, which takes 5 - 7 clock cycles.
Allow the hardware to split the register in two, but join them together again at the register read stage if needed. This method is used in some versions of Core 2 for the 8-16-32 bit registers. The register read stage in the pipeline will automatically insert an extra micro-operation when needed for joining the two partial registers into one. The delay is 2 - 3 clock cycles.
Don't split the register into parts. This method is used in AMD processors and in Intel Pentium 4 for the 8-16-32 bit registers. There is no penalty for managing partial registers and for joining them together, but every write to a partial register has a false dependency on previous writes to the same register or any part of it. The instruction scheduler has an extra dependency to keep track of.
Any write to a partial register causes the rest of the register to be set to zero. This method is used for the transition from 32 to 64-bit general-purpose registers. There is no false dependency and no splitting into partial registers. 32- and 64-bit modes cannot be mixed, but XMM and YMM instructions can be mixed. This can cause the upper part of a YMM register to be lost when executing XMM instructions.
The programmer (or compiler) can remove false dependencies by zeroing the full register, or the upper part of it, before accessing the full register. The disadvantage is that the value of the full register cannot be preserved across a call to a legacy function that sav es and restores the lower part of the register.

The announced extension from 128-bit XMM to 256-bit YMM will use a combination of the above methods, according to the preliminary info published by Intel (http://softwareprojects.intel.com/avx/). To recap the documentation, all instructions that write to an XMM register will have two versions: A legacy version that modifies the lower half of the 256-bit register and leaves the upper part unchanged, and a new version of the same instruction with a VEX prefix that zeroes the upper half of the register. So the VEX version of a 128-bit instruction uses method (5) above. It is not clear whether the legacy version of 128-bit instructions will use method (2), (3) or (4). A new instruction VZEROUPPER clears the upper half of all the YMM registers, according to method (6).

Now, I wonder if we really need the complexity of having two versions of all 128-bit instructions. The possibility of writing to the lower half of a YMM register and leave the upper half unchanged is needed only in the following scenario: A function using a full YMM register calls a legacy function which is unaware of the YMM extension but saves the corresponding XMM register before using it, and restores the value before returning. The calling function can then rely on the full YMM register being unchanged.

However, this scenario is only relevant if the legacy function saves and restores the XMM register, and this happens only in 64-bit Windows. The ABI for 64-bit Windows specifies that register XMM6 - XMM15 have callee-save status, i.e. these registers must be saved and restored if they are used. All other x86 operating systems (32-bit Windows, 32- and 64-bit Linux, BSD and Mac) have no XMM registers with callee-save status. So this discussion is relevant only to 64-bit Windows. There can be no problem in any other operating system because there are no legacy functions that save these registers anyway.

The design of the AVX instruction set allows a possible amendment to the ABI for 64-bit Windows, specifying that YMM6 - YMM15 should have callee-save status. The advantage of callee-save registers is that local variables can be saved in registers rather than in memory across a call to a library function.

The disadvantage of this hypothetical specification of callee-save status to YMM6 - YMM15 in a future Windows 64 ABI is that we will have a penalty for reading a full YMM register after saving and restoring the partial register. The cost of this is unknown as long as it has not been revealed whether method (2), (3) or (4) will be used. I assume, however, that the penalty will not be insignificant because Intel designers wouldn't have defined the VZEROUPPER instruction and recommended the use of it unless there is some situation where the penalty of partial register access is higher than the cost of zeroing the upper half of all sixteen YMM registers. But if VZEROUPPER is used for reducing th e penalty of partial register access then we have destroyed the advantage of callee-save status because all the YMM registers are destroyed anyway. This is a catch-22 situation! If there is a significant penalty to partial register access then there is no point in defining callee-save status to YMM registers. If VZEROUPPER uses 16 micro-ops then I can't imagine any situation where it saves time. Either VZEROUPPER is very fast, or the penalty for partial register access is very high, or the use of VZEROUPPER is never advantageous. Can somebody please clarify?

So if my assumptions are correct, then the advantage of having two different versions of all 128-bit instructions is minimal at best. Now, let's look at the disadvantages:

There will be a penalty for mixing the legacy XMM instructions using partial register writes with any of the full YMM instructions. Is there a penalty only when reading a full register after writing to the partial register, or are there other situations where mixing instructions with and without VEX causes delays?
Compilers will need a switch for compiling 128-bit XMM instructions with or without VEX prefix. Software developers will have problems avoiding the penalty of mixing code with and without VEX prefixes.
It will be very hard for programmers using vector intrinsics to avoid mixing the different kinds of instructions.

Do you expect all function libraries using XMM registers to have two versions of every function: a legacy version for backwards compatibility, and a version with VEX prefixes on all XMM instructions for calling from procedures that use YMM registers?

If we have two versions of every library function then we don't have to care about YMM registers being saved across a call to a legacy library function, because the compiler will insert a call to the VEX version of the function, which can save and restore the full YMM registers if required by the ABI.

It would be nice to have some indication of whether the penalty for mixing VEX and non-VEX XMM instructions is so high that we need separate VEX and non-VEX versions of all library functions. It would also be nice to know if there are any situations where the partial register penalty is higher than the time it takes to execute the VZEROUPPER instruction.

The solution of having two versions of all XMM instructions looks to me like a shortsighted patch in an otherwise well designed and future-oriented ISA extension. The problem will appear again in all future extensions of the size of the vector registers. How do you plan to solve the problem next time the register size is increased? Will we have two versions of every YMM instruction when ZMM is introduced, and two versions of every ZMM instruction when.... This would be a waste of th e few unused bits that are left in the VEX prefix.

It is probably too late to change the AVX spec now, although it looks to me like a draft, published prematurely as a response to AMD's SSE5 stunt?

Mark_B_Intel1 · ‎06-19-2008

Hi Professor Fog,

Thank you for your detailed and insightful comments.

Among the important boundary conditions to add to your list is that some of todays drivers use legacy SSE. These are being reached via interrupt and the existing drivers cant save the upper part of the live YMM registers. Theoretically a new OS could take care of this for the ISR, but we didnt want to penalize users of the legacy architecture and could not mandate a major OS re-write.

Over a decade ago when we defined legacy SSE instructions we had no vision of the way we were going to extending the vector length. The new Intel AVX 128 instructions are defined as zeroing the upper part of the register (bit 128 to infinity), and we have a new state management (XSAVE/RSTOR) that can manage state in a forward compatible manner. So if and when we will extend the vector size further the magnitude of this issue should be much smaller (driver writers, you have been warned

J).

So in brief, our solution was:

1) Legacy and new code can intermix, the first transition from 256b usage to legacy 128b will have a transition penalty (in Sandy bridge implementation it will cost several tens of cycles depending on whats in the pipe, in extreme cases it can be much longer). The HW will take care to save the upper parts and later restore (with a similar performance penalty) when returning to 256b code.

2) The compiler will map C or 128b instrisics to either the legacy 128b or the new zeroing-upper 128b forms based on a switch. A programmer could also write in assembly, and they need to keep the above in mind. Inline assembly can also be encoded to the new zeroing-upper 128b forms based on a switch. The expected result of this is that mixing legacy SSE and new 256b instructions would mostly happen at or above the function level.

3) The vZeroUpper instruction is cheap and the recommended ABI is to save the live upper part of the YMM register and issue vZeroUpper prior to calling to an external function. BTW, it will be a good programming practice to "deallocate" the upper part (using vZeroUpper) or the whole registers (using vZeroAll) when finishing part of the program that uses this registers. (optimization recommendation: This will help free up resources in the OOO machine and reduce time spent in task switches). The result of this will be that only the small overhead of vZeroUpper will be the cost of this scheme.

4) We will add performance events that count the two transitions from #1 above to help with debugging any SSE->256b transitions.

To help understand the decision process a bit better, allow me to provide some historical context. Over the past several years we evaluated all the options you have above, but particularly options 1 and 5 (and of course the current option...). Options involving partial dependencies or partial state management were difficult to make fast and/or power efficient. It also did not seem right that the new state should penalize all users of the legacy architecture".

Your "option 1" is in a couple of ways a desirable architecture. It avoids any concern of legacy code perturbing your new state. On the other hand, there are advantages to being able to interoperate cleanly with the current XMM register state: it is entrenched in the ABI's (as you note, XMM is used for argument passing), and we wanted to bring the encoding advantages of the NDS and new operations to scalar and short vector (scalar dominates most FP code today).. The big problem is that doubling the name space for what is, in effect, the same functionality, is inefficient both inside the processor as well as in the state save image.

Your "option 5" (the legacy instructions to zero the new upper state) is really attractive. The real issue here is that popular operating systems do not preserve all state on switching to an interrupt service routine (ISR). A surprising number of Ring-0 components active during ISR use XMM's - if only to move or initialize data. There are only two solutions to this problem: the OS must save all the new state or the drivers must be rewritten. There is no way we could compel the industry to rewrite/fix all of their existing drivers (for example to use XSAVE) and no way to guarantee they would have done so successfully. Consider for example the pain the industry is still going through on the transition from 32 to 64-bit operating systems! The feedback we have from OS vendors also precluded adding overhead to the ISR servicing to add the state management overhead on every interrupt. We didn't want to inflict either of these costs on portions of the industry that don't even typically use wide vectors. Architecturally, therefore, we had to prevent legacy drivers from having side effects on the new (upper) state. This means they had to merge. New drivers, aware of how to use XSAVE to manage AVX state in a forward compatible manner so as not to break apps. So for the AVX-prefixed (short vector or scalar) instructions we allow a zeroing behavior.

What we decided to do was to optimize for the common scenario of uniform blocks of 128-bit (or all scalar) code separated from uniform blocks of 256-bit code. We maintain an internal record of when we transition between states where the upper bits contain something nonzero - to a point where the state is guaranteed to be zero. We give you a fast (1* cycle throughput) way to the second state VZEROUPPER (though VZEROALL, XRSTOR and reboot also work). Once you're in that state of Zeroed-upperness, you can execute 128-bit code (or scalar) - VEX prefixed or not - and you pay no transition penalty. You can also transition back to executing 256-bit instructions also with no penalty. You can transition freely between any VEXed instruction of any width and pay no penalty. The downside is that if you try to move from 256-bit instructions to legacy 128-bit instructions without that VZEROUPPER, you're going to pay. The way we chose to make you pay is to optimize for the common use of long blocks of SSE code you pay once during the transition to legacy 128-bit code instead of on every instruction. We do it by copying the upper 128-bits of all 16 regis ters to a special scratchpad and this copying takes time (something like 50 cycles - still TBD). Then the legacy SSE code can operate as long as it wants with no penalty. When you transition back to a VEX-prefixed instruction, you have to pay the penalty again to restore state from that scratchpad.

The solution for this problem is for software to use VZEROUPPER prior to leaving your basic block of 256-bit code. Use it prior to calling any (ABI compliant) function, and prior to any blind return jump.

We've also expanded the ABI in a backwards compatible way - functions can declare and pass 256-bit arguments if so declared, but >>new state is caller save<<. Making ymm6-15 as callee save on Windows-64 would require changes in several runtime libraries related to exception handling and stack unwinding, making the changes incompatible with the current libraries. Having all new state as caller-save is slightly inefficient but the gain due to full backward compatibility outweighs the loss of performance.

For full details see the ABI extensions for each OS see the Spring Intel Developer Forum AVX presentation (https://intel.wingateweb.com/SHchina/published/NGMS002/SP_NGMS002_100r_eng.pdf). For code that doesn't vectorize, mixture of scalar (or 128-bit vector) AVX and legacy 128-bit instructions is completely painless so long as the upper state is zero previously - I would rely on compliant callers to issue VZEROUPPER or if paranoid you could zero yourself prior to the executing the function body).

So back to your scenario: A function using 256-bit AVX instructions cannot assume the callee will not modify the high bits in the YMMs. Bits 128-255 of the YMMs are caller save. The caller must also issue a VZEROUPPER in case that callee hadn't ported itself to use AVX. There's always a tradeoff in caller/callee save and the benefit here is that the callee that wasnt to use 256-INT doesn't have to worry about preserving this new state.

> Now, I wonder if we really need the complexity of having two versions of all 128-bit instructions

The non-destructive source and the compact encoding are >major< performance features that apply to scalar and short vector forms. Indeed, I expect the performance upside of these on general purpose, compiled code (that doesn't always vectorize so well) to match the upside of the wider vector width. New operations like broadcast and true masked loads and stores and the 4th operand (for the new 3-source instructions) are only available under the VEX prefix. You can freely intermix 128-bit VEX instructions with legacy 128-bit instructions. And you can intermix blocks of 256-bit instructions with legacy 128-bit instructions so long as you follow two rules: use VZEROUPPER prior to leaving a basic block containing 256-bit and adhere to caller-save ABI semantics on new state.

> There will be a penalty for mixing the legacy XMM instructions using partial register writes with any of the full YMM instructions. Is there a penalty only when reading a full register after writing to the partial register, or are there other situations where mixing instructions with and without VEX causes delays?

Without VZEROUPPER, there will be a penalty when moving from 256-bit instructions to legacy 128-bit instructions, and another penalty when moving from 128-bit instructions back to 256-bit instructions. Consider this code:

VADDPS ymm0, ymm1, ymm2

ADDSS xmm3, xmm4

VSUBPS ymm0, ymm0, ymm2

Here we would have a penalty on the ADDSS and a second penalty on the VSUBPS. (In this case it would be best just to use the VEX form of ADDSS).

> Compilers will need a switch for compiling 128-bit XMM instructions with or without VEX prefix. Software developers will have problems avoiding the penalty of mixing code with and without VEX prefixes.

Absolutely correct on the compiler switch (the compiler has a switch to select between AVX and legacy SSE generation on the Intel compiler its QxG). But since the compilers (at least for high level languages) will generate VZEROUPPER and caller-save semantics on new state whenever they succeed in autovectorizing, there won't be penalties. There are still transitions penalties that happen due to asynchronous events (such as ISR's that use XMM's) but these are not common enough to be a significant performance concern.

The tools will not generate VZEROUPPER on intrinsics or assembly - here, we give the developer the full right/control to shoot their performance in the foot. It is a particularly concern of mine that people writing intrinsics be aware they still have to use VZEROUPPER!

> It will be very hard for programmers using vector intrinsics to avoid mixing the different kinds of instructions.

I hope it will not be too hard, but it is a concern. I've ported a number of applications with both an internal version of the Intel compiler - with and without intrinsic and assembly versions of some hotspots - and the transitions haven't shown up as a noticeable contribution. Autovectorizing compilers like the Intel compiler don't have any problems. In the Intel compiler, when you compile with QxG, all your intrinsics (128-bit or 256-bit) get a VEX prefix. The only diligence required on the part of the programmer is to ensure that you have a VZEROUPPER prior to leaving that block of intrinsics. We have a tool used internally (the Intel software development emulator) that identifies any transitions. We also plan to have a performance monitoring counter in the hardware to allow tools like Intel VTune to show transition penalties.

> Do you expect all function libraries using XMM registers to have two versions of every function: a legacy version for backwards compatibility, and a version with VEX prefixes on all XMM instructions for calling from procedures that use YMM registers?

What we wouldn't do is have lots of library versions just to work around the lack of a callee-save ABI. We would have (and do today) support multiple versions of our per formance libraries optimized for different processors. But the 80/20 rule applies: most library functions are not critical performance bottlenecks and the legacy (non-VEXed) implementations work there just fine - with no transition penalties to apps that use 256-bit AVX. In the long run of course, there are no advantages to the non-VEX forms and we would like to move the industry in the direction of AVX.

> The solution of having two versions of all XMM instructions looks to me like a shortsighted patch in an otherwise well designed and future-oriented ISA extension

> The problem will appear again in all future extensions of the size of the vector registers. How do you plan to solve the problem next time the register size is increased? Will we have two versions of every YMM instruction when ZMM is introduced, and two versions of every ZMM instruction when.... This would be a waste of the few unused bits that are left in the VEX prefix.

I have a much stronger feeling that we have reached an excellent compromise. There are significant benefits to the non-destructive destination and new instruction forms that apply to scalar and short vector operations - and if you don't want/need to port, we interoperate at very high performance with the legacy instructions. Software and tools developers have very few rules to avoid performance overhead 1) when you use 256-bit forms of instructions, make sure you VZEROUPPER when leaving the block and 2) realize that the ABI remains caller save for new state. And of course - don't intentionally write code like the example above

J

I'm glad you raised the future of ZMM. If/when we adopt the next natural extension (to 512 bits or whatever), we would face the same problems with 'Legacy 256-bit' so long as the asynchronous parts of the system are not saving state in a forward-looking way, or the OS does not decide to step in and manage state on ISR's. So this is a call to the industry planning to adopt AVX in their interrupt service routines: use XSAVE. It's there to guarantee you can save both todays tomorrow's state!

> It is probably too late to change the AVX spec now, although it looks to me like a draft, published prematurely as a response to AMD's SSE5 stunt?

Let's keep having the dialogue, but I believe the software apps will be greatly benefitted by the present architecture (and microarchitecture). As we did with the Nehalem instructions, we believe its best for the software to have lead time to prepare the OS, compilers, tools, and apps early. The AVX spec is not a draft :)

Regards, Mark

AFog0 · ‎06-21-2008

Thank you so much for a very thorough and detailed answer.

If I understand you right, you decided that it is necessary to have two versions of all 128-bit instructions in order to avoid destroying the upper part of the YMM registers in case an interrupt calls a device driver using legacy XMM instructions.

This is a problem that I was not aware of. But what if the legacy device driver calls a DLL. The device driver is not updated, but the DLL is. The DLL is unaware that it is called from an interrupt. It has a CPU dispatcher and uses the YMM registers if available. The result is that the upper part of the YMM registers is lost when returning from the interrupt.

Perhaps few DLLs or shared objects will allow this, so here is a less contrived example: An old device driver is recompiled with the latest version of the compiler. The device driver has a data copying loop which the compiler replaces with a call to the library function memcpy. memcpy has a CPU dispatcher which detects the availability of YMM, isues a VZEROUPPER, and uses the YMM registers. This could happen even when compiling without the VEX switch on.

How can we prevent such errors? The safest solution is to let the OS do an XSAVE whenever an interrupt calls a device driver, DLL, or static library. This has a performance penalty.

Alternatively, make strict regulations about what a device driver can do and how it can be compiled. And make a guaranteed VEX-free version of function libraries to be used in device drivers that don't want to use XSAVE for performance reasons. This sounds like an unmanageable rule that is difficult to enforce and guarantee.

What is your advice here?

And now to your explanation of state transitions. Let me recap to make sure I understand you right:

The YMM register set has three states:
(a): The upper part of all registers is unused and known to be zero.
(b): The full registers are used. The upper part contains random data.
(c): The registers are split in two during the execution of legacy code. The upper part is saved in a scratchpad for possible rejoining with the lower part when the full registers are needed.

State transition table:

	Current state
Instruction / next state	a	b	c
VZEROALL/UPPER	a	a	a
XMM	a	c	c
VEX XMM	a	b	b
YMM	b	b	b

The transitions a -> b, b -> a, c -> a are fast (1 clock). The transitions b -> c and c -> b are very slow.

The state is stored per core, not per register. All registers in the same core are in the same state in Sandy Bridge. Future implementations would be allowed to store the state per register or use any other solution.

Since transitions to state a are fast, I assume that this is implemented in the regi ster allocation stage in the pipeline by tagging registers as "only 128 bits used". Transitions between b and c must be serializing. Hence, the transition must be detected somewhere in the in-order frontend. This begs the question about branch prediction. Is there a penalty for b to c or c to b transitions in a speculatively executed branch that turns out to be mispredicted? Is this penalty so high that it is something to warn about? (By the way: what does TBD mean?).

Compilers should put a VEX prefix on all intrinsic functions if compiling for AVX.
Compilers should put VZEROUPPER before and after any call to legacy functions if compiling for AVX.

How can the compiler know if a library function uses VEX or not unless we have two versions of every library function?

Hope you can clarify these things too. I need the info for my optimization manuals (www.agner.org/optimize).

Last, I want to compliment you for making a future-oriented change rather than the many short-sighted patches that we have seen hitherto. Why was XSAVE with a variable save area size not invented many years ago?

Early publication and public discussion of proposed ISA changes is the best way to make the changes less painful. Please continue this policy.

Along the same line, it would be nice to know if future ISA changes will be a linear progression - in other words: Will all post-Sandy Bridge processors support AVX?

An earlier publication of AVX would have prevented the unfortunate current situation that AMD and Intel are using different instruction codes for the same (or very similar) instructions. Most programmers will probably refuse to use these instructions until an agreement about which code to use is settled. This forking of the ISA is bad enough, but an even worse disaster is lurking ahead: That Intel and AMD might assign different instructions to the same opcode! This could be prevented by officially assigning part of the opcode map to other vendors. Intel, AMD and other vendors should officially commit themselves to not using certain parts of the opcode map unless copying instructions that were introduced by their competitor. Even better would be a public discussion of intended ISA extensions before the final decision is made.

Mark_B_Intel1 · ‎06-23-2008

Outstanding quesitons again, my attempt to respond to the technical ones:

1) Comments regarding risk of using YMM at ring-0 are correct. The example given is similar to the current scenario where a ring-0 driver (ISR) vendor attempts to use floating-point state, or accidentally links it in some library, in OSs that do not automatically manage that context at Ring-0. This is a well known source of bugs and I can suggest only the following:

a. On those OSs, driver developers are discouraged from using floating-point or AVX

b. Driver developers should be encouraged to disable hardware features during driver validation (i.e. AVX state can be disabled by drivers in Ring-0 through XSETBV() ).

2) Correct on the state table. That's a nice way of representing these transitions. Also correct on the transition cost except that c->a is not fast. The transitions between b->c and c->b are indeed serializing with respect to other instructions that modify this state and transition penalty varies (on the first implementation) according to how much is in flight. Instructions on nontaken (speculative) paths do not cause transition penalties. The need to protect the scratchpad from speculation means c->a is slow.

(Bob Valentine, our front end architect, provided me the following using your notation):

a b c

a fast fast ---

b fast fast slow

c slow slow fast

3) > Compilers should put a VEX prefix on all intrinsic functions if compiling for AVX.

Correct, although #pragmas exist that allow the user to specify how a block of intrinsics should be compile.

> Compilers should put VZEROUPPER before and after any call to legacy functions if compiling for AVX

>Compilers should put VZEROUPPER before any call to an ABI compliant function and prior to leaving the basic block. Its not necessary to put VZEROUPPER after the ABI function. Analysis may optimize this use. I would prefer not to use legacy function since its not generally possible for a compiler to know that and even new functions or library might use legacy SSE instructions.

4) >How can the compiler know if a library function uses VEX or not unless we have two versions of every library function?

We intend no distinction between legacy and current ABI: whatever extensions are there are caller save or (for the special case of 256-bit argumnets) represented in the signature. Not only should the compiler not have to know if a function has legacy SSE, we discourage it as the callee is free to make whatever mixed use of legacy SSE and AVX they wish (for example to optimize in case only scalar code is generated). As a result, the caller must save all upper state prior to entering a ny ABI compliant function.

For example: call function void libfoo(void) where YMM3 and YMM8 are live.

// save live state

VMOVAPS [rsp+off], YMM3 // xmm3 is volatile (hence entire YMM3) is volatile

VEXTRACTF128 [rsp+off+32], YMM8 // xmm8 is nonvolatile, but upper bits are volatile. This is a fast instruction!

// mandatory since we have nonzero state YMM upper

VZEROUPPER

call libfoo

// restore

VMOVAPS YMM3, [rsp+off]

VINSERTF128 YMM8, YMM8, [rsp+off+32]

Note that I explicitly discount non-ABI compliant calling conventions (e.g. fastcall) or interprocedural optimizations. Any compiler is free to innovate in whatever way they want in that respect, if they can convince the developer to use those features.

5) > Early publication and public discussion of proposed ISA changes is the best way to make the changes less painful. Please continue this policy.

Thank you and I appreciate your feedback!

Regards,

Mark Buxton

levicki · ‎07-10-2008

Agner's questions are indeed something to think about. To bad he isn't in the decision process.

Mark, all this looks awfully complicated for me, and I am experienced with asm/intrinsic code optimization. What do you think other less experienced developers who will be forced to deal with this mess will say, or even worse do?

There is also another valid question — Intel compiler will take care of those things but we all know that Windows n+1 will be compiled using Microsoft compiler and Linux n+1 will be compiled using gcc. How well do you think they will handle all this?

I haven't given AVX much thought so far given it is still a distant future for me, but after reading all this my opinion is that the software development will take a step backwards with it.

You have penalized developers by sheer design complexity, instead of penalizing those users who will be sticking to old hardware with broken drivers in 2010. That is just silly and I am disappointed.

It would be cheaper and more efficient in the long term if Intel just bought out those few companies producing bad drivers and fixed them, not to mention that the only affected OS is hot-patchable.

AFog0 · ‎07-14-2008

Thank you Igor for your nice words about me.

However, I don't agree that Intel has penalized developers with unnecessary complexity. On the contrary. For the first time ever, Intel has made a future-oriented extension to the x86 ISA rather than the many short-sighted patches that currently pollute the opcode map. There is more clean-up to do, but unfortunately this is not possible without making two different CPU states, a new "sanitized state" and a "compatibility state" - which would mean more complexity.

The 3-operand non-destructive instruction format is something that will make life easier for programmers like you and me. And the extension of register size is dictated by Moores law.

I agree that it is an annoying complexity to have two versions of every XMM instruction, with and without zero-extension. This is apparently necessary for the sake of compatibility with existing Windows drivers.

I wrote to the linux-kernel mailing list to discuss the problems, and it appears from their answers that they are allready prepared for the YMM extension. The rule for Linux device drivers is that it is not allowed to use any floating point or vector registers, unless you save everything by calling the function kernel_fpu_begin(). This function saves everything and disables pre-emption. This function will use XSAVE when YMM registers are supported, so there will be no problems here.

I don't have the detailed information about Windows. I can imagine that x64 Windows saves only those XMM registers that don't have callee-save status (XMM0-XMM5, and MXCSR) before calling a device driver. A device driver that wants to use e.g. XMM6 will save and restore this register, but not the YMM6 register. At least this is how I interpret Marks explanation. I don't know how MS will handle the possible use of YMM registers in device drivers. I see the following possibilities:
1. Ban the use of YMM registers in device drivers completely, just as they have banned x87 registers.
2. Save with XSAVE before calling a device driver. Allow the use of any register.
3. Specify that a device driver must save with XSAVE if using YMM registers.
4. Provide a system function analogous to kernel_fpu_begin()
5. Make lazy saving, i.e. save everything on the first attempt to touch a YMM register inside a device driver.

Solution 2 would be inefficient. Solution 5 would be fool-proof and there would be no need for all the complexity of the XMM/YMM instruction set.

If anybody is to blaim for the complexity of the YMM extension, it is Microsoft because they have made a system that is not prepared for the predictable future extension of the vector register size. You may argue, though, that Intel should have warned them.

I have published what I have found out so far about these issues in my manual "Calling conventions for different C++ compilers and operating systems" www.agner.org/optimize/.

If anybody has more details about how this will be handled in future Windows, please let me know.

levicki · ‎07-14-2008

(bows politely in front of the sensei) You are welcome Mr. Fog!

But let me get this straight — I never said AVX doesn't have any good parts. I like three operand syntax and wider registers and some new instructions (although gather and scatter are still missing!). I just think that VZEROUPPER is EMMS all over again. We didn't really need that.

In my opinion this is still a hack — instead of doing it this way (supposedly to keep compatibility) it would have been better if OS vendors were forced to save/restore full state on context switch at all priority levels, and if the silicon wasted on implementing two versions of each instruction, fast VZEROUPPER and register scratchpad was used to dramatically reduce the cost of a full state save/restore on those context switches.

If Intel has already come out with all this information about AVX way ahead of the actual hardware, they could have as well insisted on doing it right this time — they should have asked Microsoft and hardware ISVs to fix their existing OS and drivers, or at least make the future OS and drivers right and accept that some users will have to upgrade their OS or a piece of hardware (which by the way is not uncommon when upgrading a CPU) if they want to enjoy the full benefits of AVX.

This looks to me like Intel engineers have designed an aeroplane which you can use to drive on a road but the road must be compatible or else you have to get out, dismount wings, drive through the legacy road, get out, mount wings again, etc. I agree this is an improvement compared to say SSE3 but it is still long way from a clean cut of the Gordian Knot x86 ISA has became.

AFog0 · ‎07-14-2008

I tend to agree with you Igor.
It appears that all this extra complexity is made only for the sake of compatibility with existing x64 Windows device drivers. This problem could have been solved with very little performance cost by forcing Microsoft to implement the lazy saving solution I outlined above. Now we have to live with this extra complexity in all future.

Daniel_Schwarz · ‎03-22-2011

Hi!

You wrote:

3) > Compilers should put a VEX prefix on all intrinsic functions if compiling for AVX.
Correct, although #pragmas exist that allow the user to specify how a block of intrinsics should be compile.

What pragmas are those?

Can I specify that a block (or a function) should compiled entirly without legacy code?

Best regards,

Daniel

Zia_A_Intel · ‎03-23-2011

>> What pragmas are those?Can I specify that a block (or a function) should compiled entirly without legacy

We actually don't have any such "pragmas" right now. There are two ways that I can think of to do this today:

1) Put the routines that you care about in a separate file, and compile that file with the appropriate cpu-optimizaton flags (e.g. -xAVX).

2) Use cpu-dispatch. For example:

__declspec(cpu_dispatch(core_2_duo_ssse3, core_2nd_gen_avx))

void foo(...) {}

__declspec(cpu_specific(core_2_duo_ssse3))

void foo(...) {.....}

__declspec(cpu_specific(core_2nd_gen_avx))

void foo(...) {.....}

Here, everything in the cpu_specific routines will be optimized for that specific processor (i.e. the avx version will have avx encodings).

In the future, we may introduce additoinal support to use pragmas to target routines for specific cpu's, and maybe even blocks of code but, today, these are the only ways to do it.

Thanks,

Zia.

Daniel_Schwarz · ‎03-25-2011

Thanks for your answer!

Option 1) is dangerous when using global inline functions from standard library, e.g., vector functions, arithmetic functions, etc.. When using the same global inline function in a non-AVX and an AVX (with /QxAVX) source file, sometimes, the function cannot be inlined and an explicit version of the function is generated. Now, assume that the compiler generates a explicit non-AVX _and_ an AVX version of the same inline function. The problem is that the symbols will be identical. Due to the one definition rule, the Linker has to choose one of the function code chunks to be the code to be executed. When the Linker drops the nonAVX code, then each function call will make the CPU execute AVX operations. Consequently, the program crashes on a nonAVX machine when the function is called.

Here is an example. See the attached source files. Compile them in Debug mode and compile avx.cpp using /QxAVX compiler flag. There is also a VS2008 solution of that example. In short, the nonavx.cpp and avx.cpp are using the "sqrtf" function. The compiler will decide to use the "sqrtf" inline function defined in "math.h". In Debug mode, the inline function will not be inlined. The compiler generates a explicit version of the function for nonavx.cpp _and_ avx.cpp. The symbol for _sqrtf in avx.obj and is also _sqrtf in nonavx.obj. The Linker drops on my machine the nonAVX version of sqrt function. When I start the program on a nonAVX machine, it crashes.

Option 2) does not work. The Compiler does not know the cpu id core_2nd_gen_avx. Is the correct id one of the future_cpu_## IDs?

Best regards,
Daniel

TimP · ‎03-25-2011

As you can check by searching mcpcom.exe, the current "core" dispatch options are
core_2_duo_ssse3
core_2_duo_sse4_1
core_i7_sse4_2
core_aes_pclmulqdq

I suppose the possibilities for math function mixup when using -QxAVX for selected files are complicated by the possibility of using the option
-Qimf-arch-consistency:true
and the (partially implemented?) option /arch:AVX

Daniel_Schwarz · ‎03-25-2011

Thanks for your fast reply! What cpuid has to be specified to dispatch an AVX code path?

Regarding the nonAVX/AVX compiler/linker issue, I attached another example without standard library functions (very small). After compiling the code, there are duplicated symbols across the nonAVX and AVX object file. Normally, this is no problem because the linker is dropping duplicated definitions due to the one definition rule. It has to do with the inlining, doesn't it!
Why is that happening?

TimP · ‎03-25-2011

It looks like that core_2nd_gen_avx option is coming in compiler 12.0 update 3. Obviously, I'm not in a position to promise it.
I haven't studied your examples; I suppose ability to switch between AVX and non-AVX math support is supported only with /QaxAVX compilation of at least the main program. If there is a problem in those circumstances, that might be worth submitting an example to premier.intel.com.

AFog0 · ‎11-25-2016

This question has come up again now that YMM is extended to ZMM. The recommendations for the use of VZEROUPPER have been reversed for the Knights Landing processor. I have started a new thread for this discussion:

https://software.intel.com/en-us/forums/intel-isa-extensions/topic/704023

Lee_K_Intel · ‎06-29-2017

In Skylake, it was changed to use Agner's Method #4, except that there are no false dependencies or blend penalties with 128-bit instructions when the upper halves of all of the YMM registers are known to be clean, not just the YMM register being modified.

There is no one-time big penalty for saving and restoring the upper halves of YMM registers when transitioning anymore, but if the upper half of any YMM is dirty, then every non-VEX SSE instruction which modifies a XMM register serializes with any other instructions which access that XMM or the larger containing YMM, and it incurs a blend penalty, which involves issuing another uop to merge the original upper half into the result.

Any AVX software which calls SSE software must use vzeroupper / vzeroall before the call, or the SSE software will run 2-8x slower. In a loop, that would add up to much worse penalties than the one-time switchover penalties of pre-Skylake.