Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Highlighted
Beginner
506 Views

AVX-512 is a big step forward - but repeating past mistakes!

AVX512 is arguably the biggest step yet in the evolution of the x86 instruction set in terms of new instructions, new registers and new features. The first try was the Knights Corner instruction set. It had some problems and AVX512 is better, so I am quite happy that AVX512 seems to be replacing the Knights Corner instruction set. But there are still some shortsighted issues that are lilkely to cause problems for later extensions.

We have to learn from history. When the 64-bit mmx registers were replaced by the 128-bit xmm registers, nobody thought about preparing for the predictable next extension. The consequence of this lack of foresight is that we now have the complication of two versions of all xmm instructions and three states of the ymm register file. We have to issue a vzeroupper instruction before every call and return to ABI-compliant functions, or alternatively make two versions of all library functions, with and without VEX.

Such lack of foresight can be disastrous. Unfortunately, it looks like the AVX512 design is similarly lacking foresight. I want to point out two issues here that are particularly problematic:

  1. AVX512 does not provide for clean extensions of the mask registers
     
  2. The overloading of the register extension bits will mess up possible future expansions of the general purpose register space

First the new mask registers, k0 - k7. The manual says that these registers are 64 bits, yet there is no instruction to read or write more than 16 bits of a mask register. Thus, there is no way of saving and restoring a mask register that is compatible with the expected future extension to 64 bits. If it is decided to give some of the mask registers callee-save status, then there is no way of saving and restoring all 64 bits. We will be saving/restoring only 16 bits and zeroing the rest. Likewise, if an interrupt handler or device driver needs to use a mask register, it has no way of saving and restoring the full mask register short of saving the entire register file, which costs hundreds of clock cycles.

It is planned that the mask registers can grow to 64 bits, but not more, because they have to match the general purpose registers. Yet, we can predict already now that 64 bits will be insufficient within a few years. There seems to be plans to extend the vector registers to 1024 bits. Whether they should be extended further has perhaps not been decided yet (these extensions are certainly subject to diminishing returns). People are already now asking for an addition to AVX512 to support vector operations on 8-bit and 16-bit integers. A 1024 bit vector of 8-bit integers will require mask registers of 128 bits. There are apparently no plans for how the mask registers can be extended beyond 64 bits, so we will be needing another clumsy patch at that time.

Let me suggest a simple solution to this problem: Drop the mask registers and allow 8 of the vector registers to be used as mask registers. Then we can be certain that the registers used for masks will never become too small because a mask will always need fewer bits than the vector it is masking. We have 32 vector registers now, so we can certainly afford to use a few of them as mask registers. I think, generally, that it is bad to have many different register types. It delays task switching, it makes the ABI more complicated, it makes compilers more complicated, and it fills up the already crowded opcode space with similar instructions for different register types. The new instructions for manipulating mask registers will not be needed when we use xmm registers for masks, because the xmm instructions provide most of this functionality already, and much more.

So let me propose: Drop the new mask registers and the instructions for manipulating them. Allow seven of the vector registers (e.g. xmm1 - xmm7 or xmm25 - xmm31) to be used as mask registers. All mask functionality will be the same as currently specified by AVX512. This will make future extensions problem-free and allow the synergy of using the same instructions for manipulating vectors and manipulating masks.

The second issue I want to point out relates to doubling the number of registers. AVX512 doubles the number of vector registers from 16 to 32 in 64-bit mode. It is natural to ask whether the number of general purpose registers can also be doubled. In fact, it can, though this will be a little complicated. I have posted a comment on Intel's blog with a possible technical solution. I am not convinced that more general purpose registers will give a significant improvement in performance, but it is quite possible that we will need more registers in the future, perhaps for purposes that don't exist today. We should keep this in mind and keep the possibility open for having 32 general purpose registers in a future extension. Unfortunately, AVX512 is messing up this possibility by overloading the register extension bits. The X bit is reused for extending the B bit, and the V' bit is reused for extending the X bit. This is a patch that fits only a very narrow purpose. It will be a mess if these bits are needed in future extenstions for their original purpose. We need two more bits (B' and X') to make a clean extention of the register space. We can easily get one more bit by extending the 0x62 prefix byte into 0x60 and use bit 1 of the 60/62 prefix as e.g. register extension bit B'. The byte 0x60 is only vacant in 64-bit mode, but we don't need the register extension bit in 32-bit mode anyway. The bit that distinguishes AVX512 instructions from Knights Corner instructions can be used as the X' register extension bit. No CPU will ever be able to run both instruction sets, so we don't need this bit anyway.

There are other less attractive solutions in case the Knights Corner bit cannot be used, but anyway I think it is important to keep the possibility open for future extensions of the register space instead of messing up everything with short-sighted patches.

I will repeat what I have argued before, that instruction set extensions should be discussed in an open forum before they are implemented. This is the best way to prevent lapses and short-sighted decisions like these ones.

 

0 Kudos
61 Replies
Highlighted
New Contributor III
31 Views

bronxzv wrote:

Quote:

andysem wrote:The masks returned by movmsk* instructions are actually very similar to those described in AVX-512,

yes, so we have basically 2 types of mask, thus my comment "otherwise we will have 3 types of masks"

Well, since they have the same representation and semantics, I don't separate them. But ok.

bronxzv wrote:

Quote:

andysem wrote:The programmer's interface is also simplified, since there would be no need for __mmask8, __mmask16, etc. types but just __mmask64.

well I suppose it's a matter of personal taste, I strongly (pun intended) prefer strong typing as offered by __m512 , __m512i  __m512d  etc,

The parallel is not correct, the mask is always a mask. It's the number of bits that differ, not their semantics (i.e. integer vs FP). You do use __m512i to store different sized integers, after all.

bronxzv wrote:

Quote:

andysem wrote:drawback is that movmsk* instructions won't be useful for mask construction, but given their poor performance and availability of the new cmp* instructions I don't think this would be an issue.

movmsk* instructions aren't poor performance according to my experience, at least not on Intel's CPUs

movmsk* are slower than just cmp* instructions, especially on older CPUs. Surely, we don't have timings for AVX-512 yet but I hope cmp* in AVX-512 to have performance comparable to the previous extensions.

0 Kudos
Highlighted
New Contributor II
31 Views

andysem wrote:
movmsk* are slower than just cmp* instructions, especially on older CPUs.

they are 1 clock throughput / 2 clock latency since Conroe or even before IIRC

another advantage for the AVX-512 tight masks (vs your sparse masks proposal) that comes to mind is that they are directlly suited to access look up tables, it's doable to have a LUT with 16-bit indices, but obviously not with 64-bit indices

0 Kudos
Highlighted
Black Belt
31 Views

>>>What I meant is that all 64 bits could be used to simplify signal routing and add support for 8/16-bit vector elements. I.e. in case of 8-bit elements every bit of the mask is in effect, in case of 16-bit elements - bits 1, 3, 5 and so on, 32-bit - 3, 7, 11 and so on, 64-bit - 7, 15, 23 and so on.>>>

Signal routing will be probably implemented at control micro-instruction(uops?) level and when you consider 8/16/32/64 bit mask granularity it will be needed to use 2-bit bitfield to represent particular bitmask(here I suppose that operands are not encoded in micro-instruction) and are simply decoupled.At hardware level masking could be probably  performed by the  ALU(probably wil need specific control signal input) and it is interesting if the same execution Port ALU will be responsible for performing masking operation on AVX-512 vectors thus possibly staying busy when vector integer code is beign dispatched for execution.

0 Kudos
Highlighted
New Contributor I
31 Views

Agner wrote:
Whether it is too expensive to mask with 8-bit or 16-bit granularity is difficult to argue when the people who know the hardware details seem to be gagged.

GPU designers aren't doing it, and AVX-512 doesn't support it either. I think that tells us something about the significant cost of supporting fine-grained predication. And I have yet to come accross a valuable use case for which 32-bit lanes or legacy blend instructions and vector-within-vector operations would not be acceptable.

Even if it is expensive to do low-granularity masking today, it may be cheap tomorrow, or they may have to do it anyway because of demand from SW people. My point is that the design should be extensible because we can't predict the future. The history of x86 instruction sets is full of examples of very shortsighted solutions with no concern for extensibility. They are doing the same mistake again with AVX512 by making mask registers with no plan for extension beyond 64 bits.

Can't you see the irony in that? You say we can't predict the future, but you also say we should keep the mask registers extensible for low-granularity masking, which potentially hampers widening the SIMD width itself. That seems to me like the shortsightedness you intended to avoid in the first place. You have no cause for this, other than an expectation of that it might be demanded in the distant future. So the reality is that trying to not make a prediction, results in making a prediction as well. Engineers face this dilemma all the time, so in the end you just have to recognize that all the past 'mistakes' stem from the same limited foresight that we are facing now. So you can't call them mistakes unless you're willing to admit that your proposal might very well be the next one.

So in my opinion we have to try to predict the future to the best of our abilities and design for that instead of designing for something we may or may not need. Worst case you're creating something that's really powerful but for a slightly limited range of applications. Looking at GPUs and how lots of people are trying to use them for generic computing, it's not going to be bad at all to use them as the guideline. Bringing that technology into CPU cores in a homogeneous manner in and of itself widens the range of applications. And GPUs have proven that having a minimal lane width of 32-bit is not a significant limitation. For the few cases where you need maximum 8-bit performance, vector-within-vector instructions offer a practical solution. So why take any unnecessary risks?

0 Kudos
Highlighted
New Contributor III
31 Views

bronxzv wrote:

Quote:

andysem wrote:movmsk* are slower than just cmp* instructions, especially on older CPUs.

they are 1 clock throughput / 2 clock latency since Conroe or even before IIRC

As per Intinsics guide, on Sandy/Ivy bridge, pmovmskb has latency 2 clocks, throughput 1 clock; pcmpeqb and pcmpgtb - 1/0.5. On Conroe/Wolfdale the timings are [unknown] (probably, 1?)/1 and 1/0.33 respectively. On Netburst CPUs the difference is more pronounced: 7/2 and 2/2. I'm not sure about AMD CPUs but I think pmovmskb is very slow on some models.

0 Kudos
Highlighted
Beginner
31 Views

Well, now they have announced 8-bit and 16-bit granularity after all in a subset of AVX512 named AVX512BW, see https://software.intel.com/en-us/blogs/additional-avx-512-instructions

Apparently, AVX512BW will be supported already in Skylake in 2015, but not in the Knights Landing multiprocessor.

Now, this allready exhausts the capacity of the 64-bit mask registers. So now the big question is: Will future extensions to 1024 or 2048 bits have only 32 and 64 bit granularity, or will the mask registers be redesigned to make them bigger? Or will there be no future extensions?

0 Kudos
Highlighted
New Contributor II
31 Views

Agner wrote:

Well, now they have announced 8-bit and 16-bit granularity after all in a subset of AVX512 named AVX512BW, see https://software.intel.com/en-us/blogs/additional-avx-512-instructions

Apparently, AVX512BW will be supported already in Skylake in 2015, but not in the Knights Landing multiprocessor.

Now, this allready exhausts the capacity of the 64-bit mask registers. So now the big question is: Will future extensions to 1024 or 2048 bits have only 32 and 64 bit granularity, or will the mask registers be redesigned to make them bigger? Or will there be no future extensions?

The notation dst[MAX:512] := 0 used throughout the Intrinsics Guide https://software.intel.com/sites/landingpage/IntrinsicsGuide/ hints at a future extension to 1024-bit, IMHO the granularities will be kept but masks will simply grow larger, i.e. a new __mmask128 type will be introduced

0 Kudos
Highlighted
New Contributor III
31 Views

I must say I'm really happy that Intel decided to introduce AVX512BW and AVX512VL, although the announcement only mentions Xeon CPUs and doesn't say anything explicitly about desktop CPUs. I hope that the support is implied.

 

0 Kudos
Highlighted
Black Belt
31 Views

It could be interesting to compare performance of CPU SIMD 512-bit vector unit to GPU CUDA  Shader Processors (large matrix computation)

0 Kudos
Highlighted
31 Views

I think Intel and people using/developing on SIMD extensions should pay attention to thermal/power restrictions and performance penalties that new SIMD extensions introduce in the actual implementation.

For example, the new Xeon E5 v3 (Haswell core) drops speed (CPU clock) to new lower limits called "AVX base clock" which is lower than nominal CPU clock and "AVX Turbo clock" which is lower than nominal CPU Turbo clock.

So, the new E5-2699 v3 Xeon using all cores has a frequency of 2.3GHz (Base) to 2.8GHz (Turbo) with all workloads but AVX.

When it encounters AVX code the CPU frequency drops to 1.9GHz (Base) with an upper limit of 2.3GHz (Turbo) for heavy AVX usage.

This kind of performance loss in the actual implementation of a SIMD architecture, could be more significant than even design decisions of the architecture.

0 Kudos
Highlighted
Black Belt
31 Views

>>>When it encounters AVX code the CPU frequency drops to 1.9GHz (Base) with an upper limit of 2.3GHz (Turbo) for heavy AVX usage.>>>

Interesting information. May I ask you what is the source of that information?

On the other hand it is understandable that some power consumption restriction  are in place. Heavy AVX workload will utilize many of 168 available floating point physical registers while transfering data over 256-bit data paths to and from registers themselves and execution units.

0 Kudos
Highlighted
31 Views

0 Kudos
Highlighted
Black Belt
31 Views

Thank you Nikos.

0 Kudos
Highlighted
Black Belt
31 Views

I have not been able to find the presentation(s) on the Intel site, but these features are described at:

http://www.anandtech.com/show/8423/intel-xeon-e5-version-3-up-to-18-haswell-ep-cores-/5

I have only done a little but of testing on the Haswell EP systems here, but it is clear that the Turbo behavior is quite different than what I see on the Sandy Bridge EP (Xeon E5-2680).

One thing I did note is that the reduction in frequency appears to be triggered by 256-bit operations, not by AVX encodings -- so if your code uses AVX encodings for scalar or 128-bit packed operations you will not see the clock reduction.   This makes for increasingly subtle performance tradeoffs...

0 Kudos
Highlighted
Black Belt
31 Views

Aha -- using the first reference in Nikos note, I found that the frequency limits for the Xeon E5 v3 parts are all documented in the "Intel Xeon Processor E5 v3 Product Families: Specification Update" (document 330785-002, October 2014), which is currently available at http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-e5-v3-spec-up...

The "standard" Turbo boost frequencies for each processor model are in Table 2 and the "AVX" turbo boost frequencies are in Table 3.

0 Kudos
Highlighted
Beginner
31 Views

Does this AVX-specific Turbo restriction affect desktop and mobile parts, or only Xeons?

0 Kudos
Highlighted
Black Belt
31 Views

angus-hewlett wrote:

Does this AVX-specific Turbo restriction affect desktop and mobile parts, or only Xeons?

Did you try it?  

On the I5-4200U, I continue to observe performance cut-backs which persist well into single-thread regions when running 2 hyperthreads per CPU, but I don't observe a connection with AVX or even 128-bit parallel SIMD.  The effect is much more pronounced when running with Intel OpenMP library, as compared with gnu OpenMP.  Performance with Intel OpenMP depends on limiting threads to 1 per CPU.

I haven't seen documentation as to why there is no BIOS option to disable hyperthreading on some HSW CPUs, or, on others, why tinkering with that option prevents OS from starting up.

0 Kudos
Highlighted
31 Views

Using my Haswell Core i7-4790, I see no restrictions in turbo mode when running AVX applications.

BUT, the huge amount of power needed to run the task (depending on the AVX optimizations) can lead to Turbo mode elimination by triggering the process of CPU thermal/power throttling or can lead to an even lower clock for CPU, especially if you don't have proper cooling conditions for the CPU.

For example, running Prime95 v28.5 which is a heavily AVX2 (FMA) optimized application, without proper cooling I had never seen the max Turbo mode of 3.8GHz, instead from the beginning the clock was 3.6GHz (max non-Turbo clock) and after a few minutes the CPU throttles to 3.5GHz, 3.4GHz etc.

So, there are no actual, strict and direct restrictions, BUT in real life you could see CPU throttling using AVX optimized apps.

0 Kudos
Highlighted
Black Belt
31 Views

Probably a lot of energy is consumed by AVX register file hence CPU throttling when AVX code is scheduled to run.

0 Kudos
Highlighted
Novice
31 Views

Has Intel changed anything according to the above feedback?

0 Kudos