Solved: What is the status of VZEROUPPER use?

AFog0 · ‎11-25-2016

The problem with VZEROUPPER comes up again now that the recommendation for the Knights Landing processor is the opposite of previous processors.

The history is this: The extension of vector registers from 128 to 256 bits caused a problem when legacy Windows device drivers saved only the lower 128 bits of the new 256-bit registers. This problem was solved in a rather complex way. The Sandy Bridge processor could switch between a VEX state with full 256-bit registers and a non-VEX state where all the 256-bit registers were split into two 128-bit parts. The switch between these states had a cost of 70 clock cycles. The instruction VZEROUPPER was used for avoiding the cost of this state transition by clearing the upper half of all the registers. Alternatively, one could use VZEROALL to clear the whole registers. The code has to use VZEROUPPER after any code that uses 256-bit registers if there is any chance that the subsequent code contains non-VEX vector instructions. The recommendation from Intel was to use VZEROUPPER in AVX code before any call or return to an ABI-compliant function with unknown VEX status. The problems were discussed at length in this thread: https://software.intel.com/en-us/forums/intel-isa-extensions/topic/301853

This recommendation is still included in the Optimization Reference Manual. However, the same manual says that VZEROUPPER is not recommended on the new Knights Landing processor. (The book Intel Xeon Phi Coprocessor High-Performance Programming. Knights Landing Edition. 2nd ed. Elsevier, 2016. written by three Intel developers says the same).

There is obviously a need to clarify these conflicting messages, now that the vector registers are extended further from 256 to 512 bits.

My own observations are these:

The following processors have expensive state transitions and cheap VZEROUPPER and VZEROALL: Sandy Bridge, Ivy Bridge, Haswell, and Broadwell. VZEROUPPER is needed for performance reasons on these processors.
There is no expensive state transition on the later Intel processors: Skylake and Knights Landing.
There is no expensive state transition on AMD processors.
VZEROUPPER and VZEROALL are expensive on Knights Landing. I have measured 36 clock cycles for both instructions in 64-bit mode (30 clock in 32-bit mode).

It appears that VZEROUPPER is no longer needed on processors later than Broadwell and it is harmful on the first processor to support AVX512 (Knights Landing).

Since VZEROUPPER and VZEROALL affect only registers zmm0-zmm15, and not zmm16-zmm31, maybe we can avoid the need for these instructions by using only zmm16-zmm31.

In order to reach a new set of recommendations, I would like the Intel people to please answer these questions:

Is VZEROUPPER needed after AVX512 code that uses only registers zmm16-zmm31?
Will VZEROUPPER be needed for performance reasons on any processor that supports AVX512?
Will VZEROUPPER be needed for performance reasons on any future Intel processor?

If the answers to these questions are no, then I may propose the following guidelines:

AVX code should use VZEROUPPER before calling a library function or other function of unknown VEX status only on processors that support AVX but not AVX512.
A function library may have CPU dispatching with the following branches: (a) for processors that support SSE but not AVX, use non-VEX instructions. (b) for processors that support AVX but not AVX512, use VEX code and end with VZEROUPPER if any 256-bit registers have been used. (c) for processors that support AVX512, use VEX or EVEX code, don't use VZEROUPPER.

Do you think these guidelines will work? It is important that we reach a useful set of recommendations now that people are beginning to make AVX512 code.

Travis_D_ · ‎05-09-2017

For what it's worth, the vzero family is still very much needed on Skylake. The big one-time state transition penalty has disappeared, but was replaced with the ongoing penalty of blending the high half of the register for all all the non-VEX encoded instructions.

That penalty if often huge, perhaps a 2x to 6x slowdown, and it never ends while you run non-VEX code. So you very much need to do the zeroing on Skylake - not for "consistency" but for performance. This has been biting people repeatedly in weird and wonderful ways - the old one-time penalty probably occurred here too, but as a small one-time it would have never been noticed.

View solution in original post

andysem · ‎11-26-2016

The problem with those guidelines is that it is the code that is written for CPUs. In other words, the code that is written for AVX/AVX2 will still issue VZEROUPPER on a CPU supporting AVX-512. If the instruction causes a significant penalty, this may result in performance loss of the code compared to previous CPU generations. Needless to say, that would be most unfortunate. As naive as it may sound, I would urge Intel to keep VZEROUPPER low-cost to simplify transition to newer CPUs, at least in the CPU domain as opposed to the accelerators domain (Xeon Phi, etc.)

MarkC_Intel · ‎12-01-2016

Xeon Phi and Xeon have different implementations and as a result have different answers to your 3 questions. The optimization guide covers both and the recommendations for handling VZEROUPPER for the two are different as you observed. We will look at the wording in the optimization guide to see if it can be made more clear.

For your 3 questions, I think the answers are, at least for the foreseeable future:

1. Is VZEROUPPER needed after AVX512 code that uses only registers zmm16-zmm31?

No

2. Will VZEROUPPER be needed for performance reasons on any processor that supports AVX512?

Yes for Xeon, No for Xeon Phi

3. Will VZEROUPPER be needed for performance reasons on any future Intel processor?

Yes for Xeon, No for Xeon Phi.

AFog0 · ‎12-01-2016

Thank you for your answer, Mark.

However, I still think that the ABI recommendations need more discussion and revision. I am not aware of any software producer, other than Intel, that can afford to make a separate version of their code for each microprocessor version on the market. The cost of developing, testing, verifying, and maintaining so many different software versions is simply too high, and the software will inevitably lag several years behind the hardware. In other words, we cannot expect software producers to make different versions of their software with and without VZEROUPPER and to maintain a list of which microprocessor models need one version or the other. We need a standard ABI recommendation that is likely to be optimal on all future processors. Since VZEROUPPER is not needed on Skylake and Knights Landing, I expect it to be unnecessary on future Intel processors as well.

You are recommending to use VZEROUPPER on Xeon, but the name Xeon is ambiguous. This brand name has been used for several generations of processors, including Broadwell that needs VZEROUPPER, and Skylake that doesn't need it.

MarkC_Intel · ‎12-01-2016

The Knights family is Xeon Phi. The design family ... Haswell, Broadwell, Skylake, etc. forms the basis for the client and Xeon chips.

To be clear, we very much still recommend using VZEROUPPER on Skylake. Even though it does not have the same penalties as earlier designs in that family for mixing AVX and SSE code, we definitely recommend using VZEROUPPER on Skylake.

Yes it would obviously be better if there were one solution. For code that has to run on both families, the "common code" solution is to use the Xeon guidelines.

AFog0 · ‎12-02-2016

Thank you Mark for clarifying the recommendations. Are you able to reveal the technical reasons for recommending VZEROUPPER on Skylake? I can imagine the following reasons:

For compatibility with previous processors. This doesn't apply to AVX512 code.
Some future Intel processors will revert to the Sandy Bridge AVX-state design.
To avoid a minor partial register stall on the Skylake or future processors. In this case the programmer can make a cost/benefit analysis depending on the number of registers affected and the cost of the VZEROUPPER.
To save power by using the smaller registers. Again, this is subject to a cost/benefit analysis.
Some future processor will need VZEROUPPER for a different technical reason, unknown to any current processor

As you probably know, my optimization manuals are much used by compiler-makers and producers of performance-critical software, and people are relying on me for the best optimization advice. Your answers to these questions are therefore important.

MarkC_Intel · ‎12-02-2016

It is related to #3 on Skylake itself. I will poke around and see if I can say something more concrete.

AFog0 · ‎12-02-2016

I see. You are right. Executing a non-VEX SSE instruction on the Skylake has a false dependence on the previous value of the 256-bit register if the upper half of the register is dirty. The Skylake doesn't have AVX512, so the issue will be the successors of Skylake. Since register zmm16-zmm31 are not cleared by vzeroupper, it will be interesting to know how the distinction between dirty and clean upper state is made:

at the level of the individual register (e.g. register zmm1 can be dirty, while register zmm2 is clean), there are ways to clean an individual register.
for register zmm0-zmm15 collectively (i.e. if zmm1 is dirty then zmm2 is also treated as dirty)
for all registers collectively (i.e. if zmm16 is dirty then zmm1 is also treated as dirty)

AFog0 · ‎12-28-2016

I just made a few more experiments on a Haswell. It treats all vector registers as having a dirty upper half if just one ymm register has been touched. In other words, if you modify ymm1 then a non-VEX instruction writing to xmm2 will have a false dependense on the previous value of xmm2. Knights Landing has no such false dependence. Perhaps it is remembering the state of each register separately?

Hopefully, future Intel processors will either remember the state of each register separately, or at least treat zmm16-zmm31 separately so that they don't pollute xmm0-xmm15. Can you reveal something about this?

Travis_D_ · ‎05-09-2017

For what it's worth, the vzero family is still very much needed on Skylake. The big one-time state transition penalty has disappeared, but was replaced with the ongoing penalty of blending the high half of the register for all all the non-VEX encoded instructions.

That penalty if often huge, perhaps a 2x to 6x slowdown, and it never ends while you run non-VEX code. So you very much need to do the zeroing on Skylake - not for "consistency" but for performance. This has been biting people repeatedly in weird and wonderful ways - the old one-time penalty probably occurred here too, but as a small one-time it would have never been noticed.

Maxim_M_1 · ‎06-10-2017

We have the following situation now about the VZEROUPPER/VZEROALL:

These instructions are not needed and are very costly on Xeon Phi Knight Landing 36 clock cycles for both instructions in 64-bit mode (30 clock in 32-bit mode).
These instructions are very cheap and are needed on Xeon and Core processors (Skylake/Kaby Lake) and will be needed for Xeon in the foreseeble future, to avoid costly transition to non-VEX state.

The advertising materials claim that Xeon Phi (Knights Landing) is fully compatible with other Xeon processors.

Is there a reliable way to detect Xeon Phi, for the purpuse of avoiding VZEROUPPER/VZEROALL?

Our code will be the following (if we have just used ymm0 and ymm1):

if [we are running on a Xeon Phi]

vpxor ymm0,ymm0,ymm0

vpxor ymm1,ymm1,ymm1

else

vzeroall

endif

So how can we detect Xeon Phi (Knights Landing and later Xeon Phi processors) to implement the above code?

Doesn’t Intel plan to implement a CPUID bit to show whether non-VEX state are costly? For example:

Bit is set to 0 - VEX state transitions are costly, but VZEROUPPER/VZEROALL are cheap and should be used to clear the state;
Bit is set to 1 – there is no transition penalty, VZEROUPPER/VZEROALL is not needed.

The article https://software.intel.com/en-us/articles/how-to-detect-knl-instruction-support suggests to check the bits AVX-512F+CD+ER+PF as introduced in Knights Landing.

So the code suggests to check all these bits at once, and if all are set, then we are on the Knights Landing.

uint32_t avx2_bmi12_mask = (1 << 16) | // AVX-512F

(1 << 26) | // AVX-512PF

(1 << 27) | // AVX-512ER

(1 << 28); // AVX-512CD

Doesn’t Intel plan to add these all bits to a simple Xeon (non Phi) or Core processors? In this case, we won’t be able to distingush Xeon from Xeon Phi.

Please advise.

Lee_K_Intel · ‎06-29-2017

See this comment about VZEROUPPER on Skylake, which is very different from its predecessors.

Knights Landing (KNL) is more aimed at High Performance Computing, and hence it does not have the same goals as more general-purpose CPUs. It is more tuned for throughput than latency, and for computational workloads rather than for desktop or server workloads. Mixing SSE and AVX code is considered much less likely on KNL, because the AVX-512 code has way more performance than SSE (you'd be wasting resources using SSE), and because KNL is not a general-purpose processor which you would expect its customers to run legacy SSE code on. So while SSE and VZEROUPPER exist on KNL, they should be avoided. Code which runs on KNL is almost always compiled and tuned specifically for KNL, and using the latest libraries (like MKL), so having to deal with legacy SSE code is not an issue.

Transition penalties on Knights Landing (from the Optimization manual, but with minor corrections):

If an Intel AVX instruction encoded with a vector length of more than 128 bits is allocated before the retirement of previous in-flight SSE instructions.
VZEROUPPER instruction throughput is slow, and is not recommended to preface a transition to SSE code after AVX code execution. The throughput of VZEROALL is also slow. Using either the VZEROUPPER or the VZEROALL instruction is likely to result in performance loss.

AFog0 · ‎04-23-2018

Now that Skylake processors with AVX512 are available, we are able to verify how it works. My tests show the following:

The processor will switch to the dirty state when YMM0-YMM15 or ZMM0-ZMM15 are touched, but not when ZMM16-ZMM31 are touched. It switches back to the clean state when VZEROUPPER or VZEROALL is executed. You can avoid state transitions and penalties by using ZMM16-ZMM31 only. Of course this doesn't work if vector registers are used for function parameters or function returns when the calling convention dictates that the lower vector registers must be used.

SSE instructions that write to an XMM register have a false dependence on any previous write to the same register when the processor is in the dirty state. This affects mostly move instructions, while most other SSE instructions have the destination register as input anyway. The performance cost is small in most cases. I cannot confirm the 2x to 6x slowdown that Travis claims.

The conclusion is: Use VZEROUPPER when leaving VEX code, or use ZMM16-ZMM31 only. You may need to do differently on Knights Landing where VZEROUPPER is expensive.