Intel® ISA Extensions
Use hardware-based isolation and memory encryption to provide more code protection in your solutions.

AVX-512 is a big step forward - but repeating past mistakes!

AFog0
Beginner
5,333 Views

AVX512 is arguably the biggest step yet in the evolution of the x86 instruction set in terms of new instructions, new registers and new features. The first try was the Knights Corner instruction set. It had some problems and AVX512 is better, so I am quite happy that AVX512 seems to be replacing the Knights Corner instruction set. But there are still some shortsighted issues that are lilkely to cause problems for later extensions.

We have to learn from history. When the 64-bit mmx registers were replaced by the 128-bit xmm registers, nobody thought about preparing for the predictable next extension. The consequence of this lack of foresight is that we now have the complication of two versions of all xmm instructions and three states of the ymm register file. We have to issue a vzeroupper instruction before every call and return to ABI-compliant functions, or alternatively make two versions of all library functions, with and without VEX.

Such lack of foresight can be disastrous. Unfortunately, it looks like the AVX512 design is similarly lacking foresight. I want to point out two issues here that are particularly problematic:

  1. AVX512 does not provide for clean extensions of the mask registers
     
  2. The overloading of the register extension bits will mess up possible future expansions of the general purpose register space

First the new mask registers, k0 - k7. The manual says that these registers are 64 bits, yet there is no instruction to read or write more than 16 bits of a mask register. Thus, there is no way of saving and restoring a mask register that is compatible with the expected future extension to 64 bits. If it is decided to give some of the mask registers callee-save status, then there is no way of saving and restoring all 64 bits. We will be saving/restoring only 16 bits and zeroing the rest. Likewise, if an interrupt handler or device driver needs to use a mask register, it has no way of saving and restoring the full mask register short of saving the entire register file, which costs hundreds of clock cycles.

It is planned that the mask registers can grow to 64 bits, but not more, because they have to match the general purpose registers. Yet, we can predict already now that 64 bits will be insufficient within a few years. There seems to be plans to extend the vector registers to 1024 bits. Whether they should be extended further has perhaps not been decided yet (these extensions are certainly subject to diminishing returns). People are already now asking for an addition to AVX512 to support vector operations on 8-bit and 16-bit integers. A 1024 bit vector of 8-bit integers will require mask registers of 128 bits. There are apparently no plans for how the mask registers can be extended beyond 64 bits, so we will be needing another clumsy patch at that time.

Let me suggest a simple solution to this problem: Drop the mask registers and allow 8 of the vector registers to be used as mask registers. Then we can be certain that the registers used for masks will never become too small because a mask will always need fewer bits than the vector it is masking. We have 32 vector registers now, so we can certainly afford to use a few of them as mask registers. I think, generally, that it is bad to have many different register types. It delays task switching, it makes the ABI more complicated, it makes compilers more complicated, and it fills up the already crowded opcode space with similar instructions for different register types. The new instructions for manipulating mask registers will not be needed when we use xmm registers for masks, because the xmm instructions provide most of this functionality already, and much more.

So let me propose: Drop the new mask registers and the instructions for manipulating them. Allow seven of the vector registers (e.g. xmm1 - xmm7 or xmm25 - xmm31) to be used as mask registers. All mask functionality will be the same as currently specified by AVX512. This will make future extensions problem-free and allow the synergy of using the same instructions for manipulating vectors and manipulating masks.

The second issue I want to point out relates to doubling the number of registers. AVX512 doubles the number of vector registers from 16 to 32 in 64-bit mode. It is natural to ask whether the number of general purpose registers can also be doubled. In fact, it can, though this will be a little complicated. I have posted a comment on Intel's blog with a possible technical solution. I am not convinced that more general purpose registers will give a significant improvement in performance, but it is quite possible that we will need more registers in the future, perhaps for purposes that don't exist today. We should keep this in mind and keep the possibility open for having 32 general purpose registers in a future extension. Unfortunately, AVX512 is messing up this possibility by overloading the register extension bits. The X bit is reused for extending the B bit, and the V' bit is reused for extending the X bit. This is a patch that fits only a very narrow purpose. It will be a mess if these bits are needed in future extenstions for their original purpose. We need two more bits (B' and X') to make a clean extention of the register space. We can easily get one more bit by extending the 0x62 prefix byte into 0x60 and use bit 1 of the 60/62 prefix as e.g. register extension bit B'. The byte 0x60 is only vacant in 64-bit mode, but we don't need the register extension bit in 32-bit mode anyway. The bit that distinguishes AVX512 instructions from Knights Corner instructions can be used as the X' register extension bit. No CPU will ever be able to run both instruction sets, so we don't need this bit anyway.

There are other less attractive solutions in case the Knights Corner bit cannot be used, but anyway I think it is important to keep the possibility open for future extensions of the register space instead of messing up everything with short-sighted patches.

I will repeat what I have argued before, that instruction set extensions should be discussed in an open forum before they are implemented. This is the best way to prevent lapses and short-sighted decisions like these ones.

 

0 Kudos
61 Replies
QIAOMIN_Q_
New Contributor I
3,360 Views

Good catch,i agree with your last comment.

---QIAOMIN.Q

0 Kudos
capens__nicolas
New Contributor I
3,360 Views

Hi Agner,

Great to see you on this forum! Here's my intepretations of why certain AVX-512 design descisions were made:

Agner wrote:

  • AVX512 does not provide for clean extensions of the mask registers

First the new mask registers, k0 - k7. The manual says that these registers are 64 bits, yet there is no instruction to read or write more than 16 bits of a mask register. Thus, there is no way of saving and restoring a mask register that is compatible with the expected future extension to 64 bits. If it is decided to give some of the mask registers callee-save status, then there is no way of saving and restoring all 64 bits. We will be saving/restoring only 16 bits and zeroing the rest. Likewise, if an interrupt handler or device driver needs to use a mask register, it has no way of saving and restoring the full mask register short of saving the entire register file, which costs hundreds of clock cycles.

The currently defined instructions which operate on the mask registers all start with a "W" to indicate they're 16-bit. I imagine that AVX-1024 would have "D" variants to operate on 32-bit. I don't see much of a problem there and 64-bit should be possible as well.

I don't see any value in giving some of the mask registers callee-save status. AVX-512 is intended to accelerate the execution of data-parallel loops (i.e. executing multiple iterations with each instruction in a SIMD fashion). There should be no function calls in the loop, unless they can be inlined (meaning there is no actual call). If things can't be inlined then the loop is not likely to be suitable for AVX-512 in the first place. But if you look at GPUs and how they can still use their wide vectors to execute quite complex compute shaders, there really aren't many data-parallel workloads which can't be vectorized.

I didn't double-check, but I don't think interrupt handlers or drivers can use the mask registers, or any part of AVX-512 for that matter. In my opinion any actual computing should happen in user space anyway. It is why device drivers have a user space component now as well, which improves both robustness and performance.

It is planned that the mask registers can grow to 64 bits, but not more, because they have to match the general purpose registers. Yet, we can predict already now that 64 bits will be insufficient within a few years. There seems to be plans to extend the vector registers to 1024 bits. Whether they should be extended further has perhaps not been decided yet (these extensions are certainly subject to diminishing returns). People are already now asking for an addition to AVX512 to support vector operations on 8-bit and 16-bit integers. A 1024 bit vector of 8-bit integers will require mask registers of 128 bits. There are apparently no plans for how the mask registers can be extended beyond 64 bits, so we will be needing another clumsy patch at that time.

GPUs do not have instructions which operate on 8-bit or 16-bit (except possibly as the inner fields of a few vector-within-vector instructions). The way they handle 8-bit and 16-bit variables in the source code is to always expand it to 32-bit. Each mask bit therefore controls a 32-bit field. Yet they are clearly not significantly limited by this approach as they can process trillions of operations per second. So I see no compelling reason why AVX-512 would need instructions and mask registers for 8-bit or 16-bit fields. If more computing power is desired it should be achieved through widening the vectors, adding more vector units per core, and/or adding more cores. This would benefit all workloads, not just 8/16-bit ones. Vector-within-vector instructions can be very useful but only a limited set is required and the mask bits can still apply to whole 32-bit fields.

So it's important to realize that AVX-512 is not the 512-bit extension of AVX2. It is intended for an SPMD approach the way a GPU does it. Some multimedia applications which operate mainly on 8-bit and 16-bit data might be better off sticking with AVX2. Just like on a GPU, AVX-512 implementations will probably have a relatively low bandwidth per FLOP ratio. That's fine if you're doing lots of computations on relatively little data with lots of reuse of intermediate results. If on the other hand you're streaming data through and only perform a few operations on it like in some multimedia applications, then AVX2 might already be bandwidth limited so there would be no use for extending it to 512-bit for such applications.

Let me suggest a simple solution to this problem: Drop the mask registers and allow 8 of the vector registers to be used as mask registers. Then we can be certain that the registers used for masks will never become too small because a mask will always need fewer bits than the vector it is masking. We have 32 vector registers now, so we can certainly afford to use a few of them as mask registers. I think, generally, that it is bad to have many different register types. It delays task switching, it makes the ABI more complicated, it makes compilers more complicated, and it fills up the already crowded opcode space with similar instructions for different register types. The new instructions for manipulating mask registers will not be needed when we use xmm registers for masks, because the xmm instructions provide most of this functionality already, and much more.

I don't think that's an option. Two FMA instructions per cycle per core already results in 6 vector register file accesses per cycle (although many operations could come from the bypass network, but that's affected too). Using the vector registers as masks as well would require even more accesses per cycle, and more dependency checking, and I doubt that's reasonably feasible (without exceeding power consumption targets). So from that perspective it's good to have different register types as they are fully independent and don't require additional ports for the same register file or more dependencies to check for.

The only other option would be to not do the mask operation as part of the same instruction. The vblendv instruction has been quite useful for implementing SIMD branching in SSE and AVX. But as you know it takes two uops; one to extract the MSBs from a vector register to form a compact mask and one vblend operation (where the compact mask is passed in where the immediate operand would otherwise be used). So basically it's three times slower than doing the mask operation as part of an arithmetic instruction. Perhaps that can be improved on though since FMA takes three full width input operands. Also, branching is relatively rare in data-parallel code and only results which are used after the branch merge point need vblendv operations (using mask registers would typically also affect temporary results which aren't used outside of the basic block).

Anyway, having the mask registers has the additional benefit of being able to completely clock gate the unused lanes to save power. Note also that this wouldn't be possible if the masks were full width vector registers which control how the result gets blended into the destination register on a per-bit bases instead of per-field. So I'm sure that Intel has evaluated these options and found that dedicated compact mask registers were well worth it.

I don't think it adds much compiler complication. After all you can start with straightforward code which uses blend instructions, and then iteratively optimize it by making use of the mask registers. Note again that you shouldn't have to deal with saving mask registers across calls at all.

  • The overloading of the register extension bits will mess up possible future expansions of the general purpose register space

The second issue I want to point out relates to doubling the number of registers. AVX512 doubles the number of vector registers from 16 to 32 in 64-bit mode. It is natural to ask whether the number of general purpose registers can also be doubled. In fact, it can, though this will be a little complicated. I have posted a comment on Intel's blog with a possible technical solution. I am not convinced that more general purpose registers will give a significant improvement in performance, but it is quite possible that we will need more registers in the future, perhaps for purposes that don't exist today. We should keep this in mind and keep the possibility open for having 32 general purpose registers in a future extension. Unfortunately, AVX512 is messing up this possibility by overloading the register extension bits. The X bit is reused for extending the B bit, and the V' bit is reused for extending the X bit. This is a patch that fits only a very narrow purpose. It will be a mess if these bits are needed in future extenstions for their original purpose. We need two more bits (B' and X') to make a clean extention of the register space. We can easily get one more bit by extending the 0x62 prefix byte into 0x60 and use bit 1 of the 60/62 prefix as e.g. register extension bit B'. The byte 0x60 is only vacant in 64-bit mode, but we don't need the register extension bit in 32-bit mode anyway. The bit that distinguishes AVX512 instructions from Knights Corner instructions can be used as the X' register extension bit. No CPU will ever be able to run both instruction sets, so we don't need this bit anyway.

I don't think there's much of a point in having 32 general purpose registers for x86, even when looking far into the future. Having 16 registers instead of 8 (or 6 if you don't count the stack registers) made a measurable difference, but 32 registers only make sense if you have lots of temporary results or you wish to use unrolling. Typical scalar code is actually quite branchy so you barely ever have long basic blocks which have a use for more than 16 registers. And the exceptions are more often than not data-parallel code, for which you can/should use AVX-512 (which is why it does have 32 registers).

Yes, ARMv8 does have 32 scalar registers but the effort for achieving that was much lower than what it would cost x86, and I suspect that they didn't recognise that any code which has some benefit from 32 registers would actually be better off with wider vector instructions. So I find that AVX-512 does a good job at creating a balanced homogeneous high-throughput ISA with plenty of opportunity for future extensions.

That said you bring up some interesting ideas for instruction encodings but I think they're more valuable to be reserved for other extensions than for having 32 general purpose registers.

I will repeat what I have argued before, that instruction set extensions should be discussed in an open forum before they are implemented. This is the best way to prevent lapses and short-sighted decisions like these ones.

It's great that Intel now shares information on future ISA extensions ahead of the hardware being available, but I'm not sure if it's realistic or useful to go a step further and engage in discussions on an open forum. Many company's employees are restricted or strongly discouraged from posting on public forums due to the risk of revealing company IP or revealing plans to the competition. So Intel's software partners would just discuss things privately instead. Also, the number of independent people who understand all the aspects that are impacted, are far and few between. With all due respect for your much appreciated evaluations of x86 architectures, it looks like even you didn't fully understand how AVX-512 is intended to be used and why certain design decisions were made. Don't get me wrong, I'm not claiming to understand every aspect either, and we'd probably need information that only Intel has access to to be able to make a full assessment. Also note that from a PR point of view Intel probably doesn't want to sound desperate by explicitly asking the general public for help.

So the openness they have right now and our ability to discuss things here is probably the best we should expect and certainly offers a lot of room for suggestions. So please continue to visit these forums to share your valuable opinions.

0 Kudos
AFog0
Beginner
3,360 Views

Thanks for trying to clarify this.
You talk about how GPUs do things. AVX512 is not only for GPUs, it is planned to be implemented in mainstream CPUs in 2015 as I understand it. It is going to be used for all kind of purposes. This would certainly include data types that fit into 8 or 16 bits, such as text processing, encryption, file compression, DNA analysis, finite field algebra, and probably a lot of other applications that we haven't thought of.

The Knights Corner/Xeon Phi instruction set, which has no official name and no CPUID bit, was introduced as an instruction set for a MIC coprocessor, but it was designed so that it can be combined with the existing x86 instruction set. Later it was modified and renamed to AVX-512. Knights Corner was based on the old Pentium microarchitecture, and strangely supports the obsolete x87 but not SSE. It can convert integers of all sizes to floats when loading from memory and convert back to integer when saving. This means that a vector of 16 8-bit integers is handled as 16 32-bit floats. This is certainly an inefficient and power-wasting way of handling 8-bit data.

With AVX-512 we have to handle 8-bit and 16-bit integers either with AVX2 using full size masks or converting to 32-bit integers and use AVX-512 with bit masks. It is inconvenient to have two different kinds of masks. I am sure that the software community will ask for support for 8-bit and 16-bit integers in zmm registers. In fact, some have already asked for this.

Using the vector registers as masks as well would require even more accesses per cycle, and more dependency checking, and I doubt that's reasonably feasible (without exceeding power consumption targets). So from that perspective it's good to have different register types as they are fully independent and don't require additional ports for the same register file or more dependencies to check for.

Intel processors have a limitation of two input dependencies per µop except for FMA instructions. AMD has never had any such limitation, I don't know why. You seem to know about internal technical details that I have no access to. As I understand you, it is easier to check for an extra input dependency if it is from a different register file?

Anyway, having the mask registers has the additional benefit of being able to completely clock gate the unused lanes to save power. Note also that this wouldn't be possible if the masks were full width vector registers which control how the result gets blended into the destination register on a per-bit bases instead of per-field. So I'm sure that Intel has evaluated these options and found that dedicated compact mask registers were well worth it.

My idea is to use only one bit per field. When masking a vector of 32 floats, I would use only the lower 32 bits of an xmm register. Why do you think it is easier to turn off unused parts of the execution unit when the mask bits are in a mask register than in an xmm register or general purpose register?

Many company's employees are restricted or strongly discouraged from posting on public forums due to the risk of revealing company IP or revealing plans to the competition.

That's an old fashioned way of thinking. The trend goes towards more openness. People want open standards today, not de facto industry standards decided through short-sighted market considerations.

The x86 instruction set encoding is an ugly patchwork generated by a long history of short sighted solutions and dirty patches. AVX512 only makes this worse. The instruction length decoder is a power-hungry bottleneck due to all the complicated workarounds. Things would be much simpler today if Intel had made more future-oriented decisions along the way.

For example, there are three bytes in the opcode map that are used for old undocumented opcodes dating back to the first 8086 processor. I can't believe that all processors today still support these undocumented opcodes. If Intel had made an official announcement 30 years ago that these undocumented opcodes could not be used, then they could have disabled them 25 years ago and they would be available for extensions today. Instead, they are doing all kinds of desperate tricks to squeeze more and more instruction codes into an allready overloaded opcode map. And yet, there is still no official statement about these undocumented opcodes. Why not disable them today so that they will be available for other purposes in some unknown future. (They are disabled in 64-bit mode anyway).

In the same way, I think we need a decision about the future of the x87 and mmx registers. Many people regard them as obsolete, and I think they are quite costly to implement in the CPU pipeline. Microsoft tried to disable them in 64-bit Windows, but changed their mind. If the x87 registers should ever be disabled then it has to be announced maybe 10 or 20 years in advance. That's why it is not too early to discuss the future of x87 and perhaps an alternative solution for high precision math.

We need long term planning, but AVX512 is repeating past mistakes by making short-term patches that cannot easily be extended, should the need arise in the future.

0 Kudos
Bernard
Valued Contributor I
3,360 Views

>>>So from that perspective it's good to have different register types as they are fully independent and don't require additional ports for the same register file or more dependencies to check for.>>>

Do you mean architectural registers or registers of physical register file?

Btw very interesting post.

0 Kudos
Bernard
Valued Contributor I
3,360 Views

>>>I didn't double-check, but I don't think interrupt handlers or drivers can use the mask registers, or any part of AVX-512 for that matter. In my opinion any actual computing should happen in user space anyway.>>>

I think that kernel mode driver in case of performing floating point calculations which involves AVX instruction will need to preserve its floating point context by caling KeSaveExtendedProcessorState.Now regarding the future AVX-512 usage in kernel mode it depends on support from OS.IIRC it is unsafe to use AVX registers during the ISR routine.

Moving some of the  functionality to user space mode driver is related to reduction of costly context switches between user mode and kernel mode components.

0 Kudos
capens__nicolas
New Contributor I
3,360 Views

Agner wrote:
Thanks for trying to clarify this.

You talk about how GPUs do things. AVX512 is not only for GPUs, it is planned to be implemented in mainstream CPUs in 2015 as I understand it. It is going to be used for all kind of purposes. This would certainly include data types that fit into 8 or 16 bits, such as text processing, encryption, file compression, DNA analysis, finite field algebra, and probably a lot of other applications that we haven't thought of.

Yes, AVX-512 definitely seems intended to be implemented in CPUs. And that's exactly what I'm talking about. But you shouldn't forget where the design ideas originated from, as they determine the intended manner in which they should be used (but not limit what they should be used for)!

GPUs achieve their phenomenal computing power through multiple very wide SIMD units in which each lane processes one instance of a scalar 'program'. This can be a pixel shader, a vertex shader, a compute shader, etc. These programs are really just the inner code of a (nested) loop; they are executed on thousands of elements. So multiple (scalar) iterations of this loop are bundled together to form vectorized code. But each lane of the vector ALUs on which it is execute is 32-bit, even when the program processes 8-bit or 16-bit.

Again note that this is all the GPU needs to achieve its massive throughput. It can do 8-bit and 16-bit processing perfectly well, including for all the applications you mention. So AVX-512 brings that same technology straight into the CPU cores. There's no reason to suddenly go and change the fundamental computing concepts for these SIMD units. Each lane processes one iteration of a loop. If your inner loop code happens to be processing 8-bit or 16-bit values, they just get expanded to 32-bit, just like on a GPU, and just like on a GPU you'll still get a ton of computing power.

Trying to instead process 16 x 32-bit and 32 x 16-bit and 64 x 8-bit breaks down this concept of one loop iteration per lane. Switching the number of things you process in parallel requires lots of data swizzling instructions which rapidly defeats the benefits of processing more elements in the first place. The only valuable compromise is having small 32-bit or 64-bit vector operations within each lane, thus forming vector-within-vector instructions. Only a limited number of those would be of value (possibly inspired by MMX).

The only difference between AVX-512 and the GPU is that now this computing power will becomes part of the CPU cores. And that's the really revolutionary part. Using GPUs for generic applications has been notoriously difficult. Whenever your software has a big data processing loop the developer would have to isolate it and rewrite it in an SPMD fashion (the loop iteration is removed but you tell an API you want N number of instances of it, which could be executed accross many SIMD lanes and many cores and many threads). There's significant overhead in migrating tasks to a different processor type, and the tasks have to be large enough to potentially compensate for that. So often the developer has to find ways to combine multiple algorithms into even bigger loops. This is very hard and never a perfect process because scalar and vectorizable code becomes entangled. The inner code can't be too big either because GPUs have very limited register and cache storage per scalar strand. It gets even worse when trying to target multiple architectures with wildly different characteristics. So its an impossible balancing act for anything that isn't as embarassingly parallel as graphics. Heterogeneous GPGPU is destined to fail despite AMD's gallant efforts to lower part of the overhead.

Intel appears to still want to tap into the good bits of a GPU architecture, but makes it developer-friendly by unifying the CPU and GPU into a homogeneous architecture. It even becomes compiler-friendly. But to get back to my point, don't try to make AVX-512 into something it's not that would defeat this unification with GPU technology.

The Knights Corner/Xeon Phi instruction set, which has no official name and no CPUID bit, was introduced as an instruction set for a MIC coprocessor, but it was designed so that it can be combined with the existing x86 instruction set. Later it was modified and renamed to AVX-512. Knights Corner was based on the old Pentium microarchitecture, and strangely supports the obsolete x87 but not SSE.

Intel's first attempt at stealing the thunder of the GPU companies was to try and build a discrete GPU with CPU qualities: Larrabee. This was a mistake. It would have taken billions of dollars to get enough of a market share for it that developers would take a peek at it for anything other than graphics. But it's still heterogeneous, and still suffers from all its inherent overhead. Meanwhile Intel already has a huge market share for its CPUs and homogeneous computing is much more welcomed by developers. So they can just incrementally introduce GPU technology into the CPU cores at relatively limited cost and risk.

AVX-512 is a bit more revolutionary than evolutionary because it breaks away from the legacy way of thinking about CPU vector instructions, which have tightly packed 8-bit and 16-bit data. GPUs show that this isn't necessary. In fact it causes a great deal of complication with ever wider vectors. So it's a necessary break and developers who program at the assembly level have to get educated about the change in paradigm.

It can convert integers of all sizes to floats when loading from memory and convert back to integer when saving. This means that a vector of 16 8-bit integers is handled as 16 32-bit floats. This is certainly an inefficient and power-wasting way of handling 8-bit data.

Is it really? AMD's top dog GPU has 2816 scalar 32-bit ALUs on a single die. And it's only going to continue to increase! So ALUs are cheap. Even if we look at the CPU the scalar ALUs are all 64-bit and yet we use them for 16-bit and 8-bit values all the time. It's not worth it trying to use 16-bit and 8-bit ALUs instead. At a system level it's insignificant and not power-wasting at all to have 64-bit ALUs. Of course you can stuff 8 x 8-bit in 64-bit registers and process them with 64-bit vector ALUs, but that's not hugely more power-efficient than expanding them to 8 x 32-bit, especially when you take into account that those 256-bit vector ALUs are mostly used for processing actual 32-bit elements and the whole thing can scale up to 512-bit and beyond if it's not encumbered by having to cater for tightly packed 8-bit data.

And again, vector-within-vector instructions offer the best of both vector paradigms for certain use cases.

With AVX-512 we have to handle 8-bit and 16-bit integers either with AVX2 using full size masks or converting to 32-bit integers and use AVX-512 with bit masks. It is inconvenient to have two different kinds of masks. I am sure that the software community will ask for support for 8-bit and 16-bit integers in zmm registers. In fact, some have already asked for this.

I don't think it's that inconvenient to have different kinds of masks. Try to focus your reasoning on programming for a single lane in a high-level language. Either you use an if() statement, which at the vector assembly level can use the dedicated mask registers, or you just use AND/OR operators using integers as masks, which become full vectors at the assembly level after vectorization. So they really are very different but well understood concepts at the high level programming language level. It only gets confusing or inconvenient when you try to reason at the assembly level.

I hate to say this but the time when programming in assembly made you smart or cool is coming to an end. It's not a wise thing to do any more and mostly only compiler writers should concern themselves with it. But don't dispair. There was a time when people who programmed in assembly were sneered at and looked down on by the people who were programming in binary. It's a good thing that we got past that and many great things came out of higher-level programming. Only the writers of compiler back-ends should worry about binary. Don't get me wrong, I find it incredibly valuable for high-level programmers to know assembly, but only in as far as it makes them better high-level programmers. I wrote some compilers myself, but basically to get away from writing assembly.

AVX-512 will create a new era in computing, but the vast majority of developers should be able to use it effortlessly or even unknowingly in a high-level language. So only issues at the assembly level that would affect things at a high-level language level, should be brought to Intel's attention. Having two types of masking at the assembly level is a trivial concept at the high-level language, and only a minor inconvenience for compiler writers.

Quote:

Using the vector registers as masks as well would require even more accesses per cycle, and more dependency checking, and I doubt that's reasonably feasible (without exceeding power consumption targets). So from that perspective it's good to have different register types as they are fully independent and don't require additional ports for the same register file or more dependencies to check for.

Intel processors have a limitation of two input dependencies per µop except for FMA instructions. AMD has never had any such limitation, I don't know why. You seem to know about internal technical details that I have no access to. As I understand you, it is easier to check for an extra input dependency if it is from a different register file?

Actually I think your documents on micro-architecture is what first made me aware of register file port limits... Anyway, since then I've also gained some common knowledge of digital circuit design and what makes them more complex or consume more power. I can highly recommend "Digital Integrated Circuits" by Rabaey et al.

I don't have any knowledge of Intel's design but I believe multi-banking is now the norm for creating multi-ported register files. But you still have to balance the number of banks against the number of bank conflicts, and each bank can be multi-ported itself. Either way using vector registers as masks and expecting the masking to be done in the same pipelined instruction requires extra reads from the same register file, which definitely makes them more complex and power hungry. Note that GPUs even avoid the complexity of multi-banking, by reading the operands over multiple cycles. They have multiple register files which are independent because they are used for different threads/fibers. CPUs can't do that because they are expected to have high single-threaded performance, but clearly for AVX-512 to compete with GPUs in performance/Watt they need to minimize the required ports per register file. Hence having a separate register file for the masks is a good thing.

The scheduler has to check for the extra dependency on the mask register, regardless of what register file it's from. I forgot about that before. If they keep the unified scheduler then I imagine the result bus is shared between scalar and vector units so that doesn't cause any extra routing. There might be some clever reuse of the result bus for flags.

Quote:

Anyway, having the mask registers has the additional benefit of being able to completely clock gate the unused lanes to save power. Note also that this wouldn't be possible if the masks were full width vector registers which control how the result gets blended into the destination register on a per-bit bases instead of per-field. So I'm sure that Intel has evaluated these options and found that dedicated compact mask registers were well worth it.

My idea is to use only one bit per field. When masking a vector of 32 floats, I would use only the lower 32 bits of an xmm register. Why do you think it is easier to turn off unused parts of the execution unit when the mask bits are in a mask register than in an xmm register or general purpose register?

Sorry, I didn't get that you'd compact the mask into the lower part of a vector register. But that would mean that depending on the instruction the mask bits would apply to 8-bit, 16-bit, 32-bit or 64-bit fields. They'd have to be routed to very different locations based on the instruction type, and it gets worse as the SIMD width gets larger. Routing things that far with different path lengths depending on the instruction type is very difficult at ~4 GHz. Note that GPUs don't have to route masks across lanes at all. Each lane just operates as if it's a scalar core, and they each can have their own register file for the masks, which are single bits per scalar core. They practically don't have to know they're part of a SIMD array, since there's no data being exchanged between lanes (some exceptions exist).

So for AVX-512 to be power efficient and scalable you need to keep mask bits close to the lanes they affect.

Quote:

Many company's employees are restricted or strongly discouraged from posting on public forums due to the risk of revealing company IP or revealing plans to the competition.

That's an old fashioned way of thinking. The trend goes towards more openness. People want open standards today, not de facto industry standards decided through short-sighted market considerations.

I applaud the trend towards openness. I'm just trying to be realistic and I'm afraid there's still a lot of old fashioned thinking. A large portion of knowledgeable people just won't contribute to an open forum, for various reasons. There are also lots of armchair chip designers who, while intelligent, just lack the inside knowledge for why their proposal won't offer a good cost/gain balance. You can't expect Intel engineers to engage with them for the small chance of picking up a valuable idea. Don't get me wrong, hardware engineers should talk a lot to software engineers, but I'm not sure an open forum is the single most efficient way to exchange information on needs and limitations. It is one end of the communication spectrum, with a lot of noise, and I think the open discussions we have right here are the best interaction with Intel that individual developers can hope for.

The x86 instruction set encoding is an ugly patchwork generated by a long history of short sighted solutions and dirty patches.

I agree it's a patchwork and not a pretty one, but technology doesn't have to be pretty (on the inside) to be commercially successful. x86 is a huge success for multiple decades because it's a extendable patchwork. Could you have designed it significantly better while still keeping it successful for the needs during all these years? I mean, sure, in hindsight someone could easily design a better 64-bit ISA than x86-64, for today's needs. But that architecture would be very inefficient to implement a decade or two ago. So x86 has managed to stay relevant by extending the patchwork to the best of the possibilities available at each point in time. It has fantastic backward compatibility, and the value of this should not be underestimated. The advantages of a new ISA would not outweigh the disadvantages of breaking backward compatibility. There's still room for more patchwork before we get to that point.

Note that ARM is relatively young but it's starting to show signs of becoming a patchwork as well and it's going to get worse if it wants to survive and retain backward compatibility. It's just an inevitability even for the smartest engineers in the world.

AVX512 only makes this worse. The instruction length decoder is a power-hungry bottleneck due to all the complicated workarounds. Things would be much simpler today if Intel had made more future-oriented decisions along the way.

First of all let's not forget that decoding instructions with a average length of maybe 6 bytes shouldn't have to cost more power than loading data for it and executing it if it's 512-bit wide. Also, Intel's proven ability to compete with ARM designs in the mobile market with both Haswell and Atom architectures shows that the ISA isn't a huge limiting factor.

And again it's questionable whether they could have made more future-oriented decisions while still being as successful every step of the way. The instructions you might find useless today must have had a significant purpose at some point. Or maybe they were even thinking they were being future-oriented, but things went into a different direction than expected. The things you're proposing today could be equally well intentioned but turn out to be a limiting factor for the future.

For example, there are three bytes in the opcode map that are used for old undocumented opcodes dating back to the first 8086 processor. I can't believe that all processors today still support these undocumented opcodes. If Intel had made an official announcement 30 years ago that these undocumented opcodes could not be used, then they could have disabled them 25 years ago and they would be available for extensions today. Instead, they are doing all kinds of desperate tricks to squeeze more and more instruction codes into an allready overloaded opcode map. And yet, there is still no official statement about these undocumented opcodes. Why not disable them today so that they will be available for other purposes in some unknown future. (They are disabled in 64-bit mode anyway).

I'm not sure which bytes you're talking about exactly, but some have been repurposed for 64-bit mode so that's exactly the kind of useful deprecation you're talking about. Also repurposing them for 32-bit mode would be useless since everything is evolving toward 64-bit.

In the same way, I think we need a decision about the future of the x87 and mmx registers. Many people regard them as obsolete, and I think they are quite costly to implement in the CPU pipeline. Microsoft tried to disable them in 64-bit Windows, but changed their mind. If the x87 registers should ever be disabled then it has to be announced maybe 10 or 20 years in advance. That's why it is not too early to discuss the future of x87 and perhaps an alternative solution for high precision math.

Agreed that it's the right time to think about a new purpose for that opcode space. AMD already dropped 3DNow! support. x87 and MMX are still used quite a bit though by legacy software that is still in use. I don't think they're costly to implement, after all they were introduced many years ago and now they only cost a tiny fraction of the transistor budget and it keeps getting cheaper. But their opcode space could be quite valuable.

0 Kudos
Clothoid
Beginner
3,360 Views

Again note that this is all the GPU needs to achieve its massive throughput. It can do 8-bit and 16-bit processing perfectly well, including for all the applications you mention. So AVX-512 brings that same technology straight into the CPU cores. There's no reason to suddenly go and change the fundamental computing concepts for these SIMD units. Each lane processes one iteration of a loop. If your inner loop code happens to be processing 8-bit or 16-bit values, they just get expanded to 32-bit, just like on a GPU, and just like on a GPU you'll still get a ton of computing power.

An important functionality that has to be supported by the ISA to enable this are gather operations that allow to upconvert 8 or 16 bit data types into 32 bit types and scatter operations that allow downconversion from 32 bit to 8 and 16 bit data types. The current Xeon Phi ISA and the orriginal Larrabee ISA supported this functionality - but i am not sure if this functionality was forgotten in AVX-512.

AT least when working with the latest Intel compiler 14.0 SP1 Update 1 my impression is that upconversion and downconversion with gather/scatter is missing in AVX512.

0 Kudos
andysem
New Contributor III
3,360 Views

c0d1f1ed wrote:

GPUs do not have instructions which operate on 8-bit or 16-bit (except possibly as the inner fields of a few vector-within-vector instructions). The way they handle 8-bit and 16-bit variables in the source code is to always expand it to 32-bit. Each mask bit therefore controls a 32-bit field. Yet they are clearly not significantly limited by this approach as they can process trillions of operations per second. So I see no compelling reason why AVX-512 would need instructions and mask registers for 8-bit or 16-bit fields. If more computing power is desired it should be achieved through widening the vectors, adding more vector units per core, and/or adding more cores. This would benefit all workloads, not just 8/16-bit ones. Vector-within-vector instructions can be very useful but only a limited set is required and the mask bits can still apply to whole 32-bit fields.

So it's important to realize that AVX-512 is not the 512-bit extension of AVX2. It is intended for an SPMD approach the way a GPU does it. Some multimedia applications which operate mainly on 8-bit and 16-bit data might be better off sticking with AVX2. Just like on a GPU, AVX-512 implementations will probably have a relatively low bandwidth per FLOP ratio. That's fine if you're doing lots of computations on relatively little data with lots of reuse of intermediate results. If on the other hand you're streaming data through and only perform a few operations on it like in some multimedia applications, then AVX2 might already be bandwidth limited so there would be no use for extending it to 512-bit for such applications.

I humbly disagree that 8/16 bit operations are not needed. It sounds like "640k will be enough for anyone."

Bandwidth has a potential for increase (DDR4, Crystall Well) while CPU clocks and CPI seems to be close to their limits. Even with current technologies, I find it hard to believe that multimedia applications won't benefit from wider vectors, given 8/16 bit operations are available. After all, if it's slower than memcpy then chances are high it's computation-bound, and this is usually the case in my (related to multimedia) practice. It is often possible to design your data processing workflow to conserve bandwidth and favor more intermediate calculations. It may not be easy, but this makes it possible to increase performance when wider vectors are available. And having wide vectors without being able to operate on the smaller units just feels like a waste to me.

c0d1f1ed wrote:

Anyway, having the mask registers has the additional benefit of being able to completely clock gate the unused lanes to save power.

I'm not sure clock gating actually does any benefit in this case. You either use masks+xmm registers, or only xmm or none of them (in case of the generic x86 code). Clock gating only saves power in the second case, relative to the first one. But if you use xmm registers as masks you actually make the first case equivalent to the second one and thus more power efficient.

c0d1f1ed wrote:

I don't think it adds much compiler complication. After all you can start with straightforward code which uses blend instructions, and then iteratively optimize it by making use of the mask registers. Note again that you shouldn't have to deal with saving mask registers across calls at all.

Not sure how much complication it brings but it requires more from the register allocator and dependency tracker in the compiler. And the compiler may have to spill/restore mask registers in contexts other than function calls. For example, the algorithm may require more masks than mask registers available in the CPU.

And having to call a function in a SIMD-optimized loop is not that far fetched example, actually. For instance, I once implemented some string formatting routines optimized for SSE/AVX, which had to call a function to store the partial results in the STL stream. Naturally, I had to vzeroall before the call in case of AVX, and in case of SSE the compiler generated xmm register spills/restores around it. You may argue that this is not a good algorithm for SIMD-optimization but I think otherwise. As long as it's faster than the scalar version, it is a good candidate.

0 Kudos
andysem
New Contributor III
3,360 Views

c0d1f1ed wrote:

Even if we look at the CPU the scalar ALUs are all 64-bit and yet we use them for 16-bit and 8-bit values all the time. It's not worth it trying to use 16-bit and 8-bit ALUs instead. At a system level it's insignificant and not power-wasting at all to have 64-bit ALUs. Of course you can stuff 8 x 8-bit in 64-bit registers and process them with 64-bit vector ALUs, but that's not hugely more power-efficient than expanding them to 8 x 32-bit, especially when you take into account that those 256-bit vector ALUs are mostly used for processing actual 32-bit elements and the whole thing can scale up to 512-bit and beyond if it's not encumbered by having to cater for tightly packed 8-bit data.

And again, vector-within-vector instructions offer the best of both vector paradigms for certain use cases.

Using wider unit sizes and vector-within-vector tricks often require additional overhead to handle overflows and saturation that comes naturally with native support for narrow unit sizes. Unit size conversion also takes clocks.

0 Kudos
Bernard
Valued Contributor I
3,360 Views

.>>> I can highly recommend "Digital Integrated Circuits" by Rabaey et al.>>>

If you are interested in design of digital logic I would like to recommend a fairy advanced book on digital system design and firmware algorithms written by Milos D. Ercegovac "Digital Systems and Hardware/Firmware Algorithms.

0 Kudos
Bernard
Valued Contributor I
3,360 Views

>>> (the loop iteration is removed but you tell an API you want N number of instances of it, which could be executed accross many SIMD lanes and many cores and many threads). >>>

Actually it is GPU kernel whose N instances are executed by N threads each of them operating on one data element.

0 Kudos
AFog0
Beginner
3,360 Views

c0d1f1ed wrote:

But you shouldn't forget where the design ideas originated from, as they determine the intended manner in which they should be used (but not limit what they should be used for)!

This is the kind of thinking I am warning about. If you have one particular type of application in mind while designing something you risk creating a specialized system that is not suited for other purposes.

If your inner loop code happens to be processing 8-bit or 16-bit values, they just get expanded to 32-bit, just like on a GPU, and just like on a GPU you'll still get a ton of computing power.

Why go for a ton when you can have 4 tons? You can have 16 floats (or 32-bit ints) in a 512 bit register, but you can have 64 8-bit integers with the same register size if it supports 8-bit operations.

Trying to instead process 16 x 32-bit and 32 x 16-bit and 64 x 8-bit breaks down this concept of one loop iteration per lane. Switching the number of things you process in parallel requires lots of data swizzling instructions which rapidly defeats the benefits of processing more elements in the first place.

It is the Knights Corner/Xeon Phi that requires a lot of swizzling. It will convert a vector of 8-bit integers to floats when loading from memory, do all calculations on floats, and convert back to 8-bit integers when saving the results. This is a very inefficient way of using the vector register, it is slower, and it uses more power.

The only difference between AVX-512 and the GPU is that now this computing power will becomes part of the CPU cores. And that's the really revolutionary part. Using GPUs for generic applications has been notoriously difficult.

I agree that it is good to move the work from the GPU to the CPU. I have never used GPUs for generic applications because of the overhead and compatibility problems. But revolutionary? It is just an extension of register size from 256 to 512 bits. That was certainly expected. Masked/predicated instructions, more registers, and many new instructions. That is a big step forward, and thank you for that, but I wouldn't call it revolutionary :)  But if you are right that people will stop using the GPU then, yes, you might call that revolutionary.

I hate to say this but the time when programming in assembly made you smart or cool is coming to an end. It's not a wise thing to do any more and mostly only compiler writers should concern themselves with it.

I agree. The lowest level that a programmer would deal with today is intrinsic functions. But still, somebody has to make the compiler.

I applaud the trend towards openness. I'm just trying to be realistic and I'm afraid there's still a lot of old fashioned thinking.

That's why we have to put pressure on them and tell them that their customers want openness.

Don't get me wrong, hardware engineers should talk a lot to software engineers, but I'm not sure an open forum is the single most efficient way to exchange information on needs and limitations. It is one end of the communication spectrum, with a lot of noise, and I think the open discussions we have right here are the best interaction with Intel that individual developers can hope for.

This is not interaction, it is one-way communication. Intel engineers may listen but keep silent, for the reasons you mention. The problem is that no matter how excellent ideas we might come up with in this forum, it is inevitably too late to change anything at the time Intel have published a finished and complete plan. They probably have it all implemented in silicon by now. Noise can be dealt with by moderating the forum, but I don't think there is much noise here compared to other forums.

Don't forget that AMD have sued Intel for unfair competition. They might sue again if they they can make a credible claim that Intel gets an advantage by dictating a de-facto standard that AMD have to follow with a time lag of several years.

The x86 instruction set encoding is an ugly patchwork generated by a long history of short sighted solutions and dirty patches.

I agree it's a patchwork and not a pretty one, but technology doesn't have to be pretty (on the inside) to be commercially successful. x86 is a huge success for multiple decades because it's a extendable patchwork.

I would say it's a success despite being very difficult to extend. If they had left just a few unused code bytes for future extensions, the coding would be much simpler today, and for example SSSE3 instructions would be one byte shorter. The ugly patches in the instruction coding scheme are not seen even by assembly programmers unless you study the details of how bits are coded. I had to do this because I have made a disassembler, but others are happily unaware how complicated it is.

x87 and MMX are still used quite a bit though by legacy software that is still in use. I don't think they're costly to implement, after all they were introduced many years ago and now they only cost a tiny fraction of the transistor budget and it keeps getting cheaper. But their opcode space could be quite valuable.

Some CPUs have an extra pipeline stage just for handling the rolling x87 register stack. An extra pipeline stage means longer branch misprediction delays, also for non-x87 code. The rolling register stack is a completely different paradigm from the rest of the design. I can imagine that this is messing up the pipeline design and calling for a lot of compromises. It might be useful to move all x87 processing to a separate on-chip "coprocessor", but the overlay of mmx registers on x87 registers makes this difficult. It is probably easier to get rid of MMX code than x87 code since most MMX code has been updated to SSE2 anyway.

andysem wrote

And having to call a function in a SIMD-optimized loop is not that far fetched example

I agree. Intel have an excellent function library called SVML (small vector math library) with all common mathematical functions in vector form. This library is indeed intended for being called from vectorized code. If we could make one or a few of the mask registers callee-save then we can use a mask register across a function call, and the leaf function itself would of course use other mask registers that are not callee-save. We are certainly missing an instruction to save a full mask register in a way that is compatible with future extensions.

0 Kudos
capens__nicolas
New Contributor I
3,360 Views

Clothoid wrote:
An important functionality that has to be supported by the ISA to enable this are gather operations that allow to upconvert 8 or 16 bit data types into 32 bit types and scatter operations that allow downconversion from 32 bit to 8 and 16 bit data types. The current Xeon Phi ISA and the orriginal Larrabee ISA supported this functionality - but i am not sure if this functionality was forgotten in AVX-512.

AVX-512 has down convert instructions and zero/sign extend instructions for dealing with packed small data elements. And just like the current Xeon Phi ISA it only has 32-bit and 64-bit gather. Smaller elements could be gathered by masking the upper parts of 32-bit elements, if you allocated some padding. In the wost case you have to fall back to scalar code. But you should really consider rethinking your algorithms if you need lots of byte gathers.

AT least when working with the latest Intel compiler 14.0 SP1 Update 1 my impression is that upconversion and downconversion with gather/scatter is missing in AVX512.

As far as I know that has never existed as part of gather/scatter. Anyway, note that there's always the possibility to add a few more instructions. There isn't an architectural or encoding limitation of AVX-512 that stands in the way of it as far as I can tell. It's just a matter of requiring really good use cases that justify it. Of course it would be nice to make the ISA as complete as possible from the get-go, but as we've seen in the past adding too many instructions too early can be a waste if they have no real value.

0 Kudos
capens__nicolas
New Contributor I
3,360 Views

andysem wrote:
I humbly disagree that 8/16 bit operations are not needed. It sounds like "640k will be enough for anyone."

I understand your concern, but please avoid that somewhat straw-man like argument. It is abused or misinterpreted too often.

I didn't say 8/16-bit operations aren't needed at all. I was merely saying that the mask registers don't have to apply to individual packed bytes. You can have vector-within-vector instructions in which the lanes are 32-bit or larger. And for certain packed 8/16-bit operations you can also still use AVX2.

What I also tried to say is that 512-bit instructions that operate on packed 8/16-bit elements might not be possible without sacrificing latency and/or power, and even more importantly may stand in the way of future widening. As proven by GPU computing, most workloads are 32/64-bit so it's more valuable to be able to later extend to 1024-bit than to make sure the full width of your ALUs is used for any data. Unless you want to claim 512-bit is enough for everyone doing 32/64-bit processing?

Perhaps it really is, because GPUs have stopped widening their vectors because otherwise branch divergence would lead to too low efficiency...

Bandwidth has a potential for increase (DDR4, Crystall Well) while CPU clocks and CPI seems to be close to their limits. Even with current technologies, I find it hard to believe that multimedia applications won't benefit from wider vectors, given 8/16 bit operations are available. After all, if it's slower than memcpy then chances are high it's computation-bound, and this is usually the case in my (related to multimedia) practice. It is often possible to design your data processing workflow to conserve bandwidth and favor more intermediate calculations. It may not be easy, but this makes it possible to increase performance when wider vectors are available. And having wide vectors without being able to operate on the smaller units just feels like a waste to me.

While DDR4 and L4 cache are nice ways to increase bandwidth, they're only relatively small effective improvements and they don't come for free. These techologies will have to deliver all the bandwidth we need for many years to come. Meanwhile computational power increases at a faster pace, as it always has. Core count is virtually unlimited, while DDR5 and L5 would have even lower diminishing returns. GPUs already suffer from this quite badly. The historic rule of 1 byte of RAM bandwidth per FLOP is no longer achieved, and there will be plenty more room for more compute cores on new process nodes that make it worse.

AVX-512 and its successors will eliminate the GPU but inherit its issues. In a sense it's of course great that many workloads won't be arithmetic limited any more. So my point is that you shouldn't be too concerned about these 8/16-bit element workloads, especially with vector-within-vector instructions for the most common cases.

I'm not sure clock gating actually does any benefit in this case. You either use masks+xmm registers, or only xmm or none of them (in case of the generic x86 code). Clock gating only saves power in the second case, relative to the first one. But if you use xmm registers as masks you actually make the first case equivalent to the second one and thus more power efficient.

It's really the opposite. The k mask registers allow clock gating per scalar element in the vector. It's masking out calculations for branches of code those elements are logically not executing, but SIMD forces them to stay in lock-step with the elements (aka. strands/lanes) that do take those code paths.

Not having the mask registers makes it impractical to implement this clock gating. Care to elaborate why you think not having them would save more power through clock gating instead?

Not sure how much complication it brings but it requires more from the register allocator and dependency tracker in the compiler. And the compiler may have to spill/restore mask registers in contexts other than function calls. For example, the algorithm may require more masks than mask registers available in the CPU.

It doesn't require "more" from the register allocator at all. It is completely orthogonal and can be used opportunistically. It would have made things harder if the mask registers had to be shared in some way, but you get them in addition to all the rest.

Yes, some code could use more mask registers than available, but how is this worse than when having zero mask registers?

And having to call a function in a SIMD-optimized loop is not that far fetched example, actually. For instance, I once implemented some string formatting routines optimized for SSE/AVX, which had to call a function to store the partial results in the STL stream. Naturally, I had to vzeroall before the call in case of AVX, and in case of SSE the compiler generated xmm register spills/restores around it. You may argue that this is not a good algorithm for SIMD-optimization but I think otherwise. As long as it's faster than the scalar version, it is a good candidate.

Something being just a little faster than the scalar version does not justify breaking AVX-512's programming model for. It aims to parallelize 32-bit code 16-fold. It seems that you're still stuck thinking about the old CPU vector instruction paradigms. The only way to understand AVX-512 and realize what you shouldn't try to force it to be is to consider it the unification of GPU technology into the CPU cores.

0 Kudos
andysem
New Contributor III
3,362 Views

c0d1f1ed wrote:

I understand your concern, but please avoid that somewhat straw-man like argument. It is abused or misinterpreted too often.

I'm sorry but I don't think I'm making a straw-man's argument. I really got the impression you're making a point that 8/16-bit operations are not needed (in AVX-512). Perhaps these words made me think so:

c0d1f1ed wrote:

So I see no compelling reason why AVX-512 would need instructions and mask registers for 8-bit or 16-bit fields.

In any case, no offense was meant on my side.

c0d1f1ed wrote:

What I also tried to say is that 512-bit instructions that operate on packed 8/16-bit elements might not be possible without sacrificing latency and/or power, and even more importantly may stand in the way of future widening. As proven by GPU computing, most workloads are 32/64-bit so it's more valuable to be able to later extend to 1024-bit than to make sure the full width of your ALUs is used for any data.

I don't have the data to judge which workloads (on smaller or wider units) are more widespread, but I don't think that GPU computing makes a strong case here. GPUs are known to have a more narrow scope of applications than CPU instruction extensions. There are multiple reasons for that and I suspect that architectural limitations are not the last of them.

I won't argue that support for 8/16-bit elements is free. I'm just saying that this support is needed for a certain (and quite significant, IMHO) amount of applications. You also said that ALUs are cheap, so it seems there shouldn't be much of a problem putting them into silicon to support smaller elements for the benefit of those applications.

However, I could understand if support for smaller elements was a too big step to make for the first implementation of AVX-512, so that the support is added later, in AVX-512-2 or whatever its name is. But in order for this to be possible, mask registers should also have a clear path of extension.

Saying that AVX2 is enough for 8/16-bit operations and AVX-512 and later instruction sets are for larger units only is what I disagree with and was comparing to the "640k will be enough for anyone" saying.

c0d1f1ed wrote:

While DDR4 and L4 cache are nice ways to increase bandwidth, they're only relatively small effective improvements and they don't come for free. These techologies will have to deliver all the bandwidth we need for many years to come. Meanwhile computational power increases at a faster pace, as it always has. Core count is virtually unlimited, while DDR5 and L5 would have even lower diminishing returns. GPUs already suffer from this quite badly. The historic rule of 1 byte of RAM bandwidth per FLOP is no longer achieved, and there will be plenty more room for more compute cores on new process nodes that make it worse.

Yes, increasing the number of cores is a way of increasing computational power, but it is also known that not all workloads can benefit from this kind of parallelism. I think this is the main reason we're mostly using 2-4 core CPUs now (sans supercomputers). I consider single thread performance just as important as the number of hardware threads.

c0d1f1ed wrote:

The k mask registers allow clock gating per scalar element in the vector. It's masking out calculations for branches of code those elements are logically not executing, but SIMD forces them to stay in lock-step with the elements (aka. strands/lanes) that do take those code paths.

Not having the mask registers makes it impractical to implement this clock gating.

I don't have the background in designing CPUs, so my point of view is probably quite naive. I don't understand what you meant by "SIMD forces them to stay in lock-step with the elements (aka. strands/lanes) that do take those code paths." Why is it not practical to perform clock gating based on bits from xmm registers instead of k registers?

c0d1f1ed wrote:

Care to elaborate why you think not having them would save more power through clock gating instead?

By removing k registers you also remove all power consumption associated with them, don't you?

c0d1f1ed wrote:

Yes, some code could use more mask registers than available, but how is this worse than when having zero mask registers?

My point was that there is need for the operations to save and restore the registers. If xmm registers are used as masks then there already are such operations and the point is moot.

c0d1f1ed wrote:

Something being just a little faster than the scalar version does not justify breaking AVX-512's programming model for. It aims to parallelize 32-bit code 16-fold. It seems that you're still stuck thinking about the old CPU vector instruction paradigms. The only way to understand AVX-512 and realize what you shouldn't try to force it to be is to consider it the unification of GPU technology into the CPU cores.

"A little faster" was somewhat an order of magnitude in terms of data throughput, IIRC, so it was quite significant. Maybe I'm thinking old ways, but I apply the suggested architectural improvements to the cases I have at hand and see that it doesn't apply that well.

0 Kudos
Bernard
Valued Contributor I
3,362 Views

>>

The k mask registers allow clock gating per scalar element in the vector. It's masking out calculations for branches of code those elements are logically not executing, but SIMD forces them to stay in lock-step with the elements (aka. strands/lanes) that do take those code paths.

Not having the mask registers makes it impractical to implement this clock gating.>>>

As far as I understand addition of mask registers at hardware level(register file) instead of using xmm/ymm bits could force the reimplementation of control signals either vertical or horizontal encoding to include additional bits of mask register and increased wire latency(delay) when accessing those bits thus at miniscule time interval of one clock generating more energy.

0 Kudos
bronxzv
New Contributor II
3,361 Views

andysem wrote:

By removing k registers you also remove all power consumption associated with them, don't you?

the k logical registers will most probably map to the same physical register file than GPRs (*1) thus "removing" the k registers will not save power since the register files will be actually kept unchanged, on the other hand doing the logical operations on 512-bit registers (as defined in the OP's proposal if I got it right) instead of 8-bit/16-bit masks will use arguably more power in pure waste, as you know probably a classical use case with masks is to compute a series of mask with compare instructions, then to AND or OR them together before to use only the final mask for actual masking, it's quite common to have more instructions for computing the masks than instructions using them so having the k logical registers of the smallest useful width is arguably a sensible choice to save power by moving around and computing up to 64x less bits (with packed doubles) than with full width zmm registers

*1: in the initial AVX-512 products, but maybe will map to the physical vector register file (as x87/xmm/ymm/zmm logical registers map to a common physical register file) in the future allowing for example 128-bit masks for AVX-1024 with 8-bit elements, it will be just a matter to define the new max width for k registers as 128-bit and to provide the necessary spill/fill instructions

 

0 Kudos
capens__nicolas
New Contributor I
3,361 Views

Agner wrote:

c0d1f1ed wrote:
Quote:

But you shouldn't forget where the design ideas originated from, as they determine the intended manner in which they should be used (but not limit what they should be used for)!

This is the kind of thinking I am warning about. If you have one particular type of application in mind while designing something you risk creating a specialized system that is not suited for other purposes.

You talked about x86 being a patchwork due to mistakes made in the past. I'm afraid that in many cases trying to design something with multiple computing paradigms in mind is exactly what lead to such mistakes. Only a few paradigms survive the test of time and lead to an elegant, scalable architecture. I say paradigm instead of "application" because my interpretation of AVX-512 being the unification of GPU technology into the CPU is absolutely not limited to one type of application.

GPU-style parallel computing is a powerful 'new' computing paradigm being introduced into the CPU. But trying to run more than 16 maskable strands in a 512-bit SIMD unit is in my opinion a mistake which leads to bad scalability. So with all due respect I find it quite ironic that you're being so hard on Intel for creating a patchwork with mistakes that lead to overhead, while you are potentially suggesting something that would be regarded as a problem in the future. GPUs have 32-bit lanes, for very good reasons despite that some of the data is 16 or 8-bit, so why would you ignore that altogether?

I'm not claiming I have all the right answers here. I fact my point is that nobody has them. Only hindsight is 20/20 so it's clear to identify past mistakes, but right now we're dealing with the very same kind of unknowns that the Intel engineers were facing when they made the past mistakes that make x86 less than optimal. So I'm just humbly asking to leave that bias behind so we can focus our discussion on the technical issue and hopefully help avoid mistakes.

Quote:

If your inner loop code happens to be processing 8-bit or 16-bit values, they just get expanded to 32-bit, just like on a GPU, and just like on a GPU you'll still get a ton of computing power.

Why go for a ton when you can have 4 tons? You can have 16 floats (or 32-bit ints) in a 512 bit register, but you can have 64 8-bit integers with the same register size if it supports 8-bit operations.

Because being 'greedy' like that seldom produces the best results in the long run. Yes, you can have 4x higher performance for 8-bit, in this time frame. But in 20 years from now we'll have CPUs with more cores and wider vectors than memory bandwidth and power consumption allow to be active all at the same time. So computing resources is not the issue. Power and bandwidth are. Of course 8-bit vector instructions can save on power for 8-bit vectorizable workloads, but you have to look at the bigger picture. What's the impact on scalar and 32-bit vector workloads? And why not take the best of both worlds with vector-within-vector instructions? What kind of use cases do you have that are worth risking to lose the scalability that GPUs have?

Quote:

Trying to instead process 16 x 32-bit and 32 x 16-bit and 64 x 8-bit breaks down this concept of one loop iteration per lane. Switching the number of things you process in parallel requires lots of data swizzling instructions which rapidly defeats the benefits of processing more elements in the first place.

It is the Knights Corner/Xeon Phi that requires a lot of swizzling. It will convert a vector of 8-bit integers to floats when loading from memory, do all calculations on floats, and convert back to 8-bit integers when saving the results. This is a very inefficient way of using the vector register, it is slower, and it uses more power.

As the master of analyzing x86 architectures, could you try and measure the exact power consumption difference between using 8-bit and 32-bit vector instructions (same number of elements, i.e. MMX vs. AVX2)?

I don't think the difference will be all that big. And again, look at the bigger picture of what the impact would be to have 16-bit and 8-bit variants of all 512-bit vector instructions, as well as masking for these smaller elements. How much perf/Watt for 32-bit workloads are you willing to sacrifice for that? And I don't mean to sound like a broken record, but why can't vector-within-vector instructions be a near-perfect compromise?

Quote:

The only difference between AVX-512 and the GPU is that now this computing power will becomes part of the CPU cores. And that's the really revolutionary part. Using GPUs for generic applications has been notoriously difficult.

I agree that it is good to move the work from the GPU to the CPU. I have never used GPUs for generic applications because of the overhead and compatibility problems. But revolutionary? It is just an extension of register size from 256 to 512 bits. That was certainly expected. Masked/predicated instructions, more registers, and many new instructions. That is a big step forward, and thank you for that, but I wouldn't call it revolutionary :)  But if you are right that people will stop using the GPU then, yes, you might call that revolutionary.

I think you're vastly underestimating the possibilities. Almost any loop with independent iterations can be vectorized. That's most loops in compute-intensive applications, and the hotspots are always in the loops. None of today's mainstream compilers vectorize loops this way, so it's going to be a huge paradigm shift when suddenly 16 iterations are executed per instruction. AMD is putting a crazy amount of effort (relative to their resources) into trying to create a heterogeneous architecture in which the GPU can be reasonably used by generic applications. They do this because the gains are so huge. Unfortunately for them heterogeneous computing is inherently developer-unfriendly.

Mark my words, this paradigm shift will revolutionize the meaning of computing. It has revolutionized graphics computing within GPUs, but we have yet to see the full extent in which it can and will impact generic computing on the CPU. There will be new applications that are simply not commercially feasible today. You're right that it's 'only' a doubling of 256-bit to 512-bit, but how many applications do you know which use 256-bit SIMD vectorization of loops? AVX2 requires intrinsics, which takes a lot of hours and skilled engineers (read: expensive). AVX-512 makes it feasible to let the compiler to do it for you. We'll basically go from scalar loops to 16x parallelized loops. I don't know any other word than revolutionary, to describe that.

Quote:

I agree it's a patchwork and not a pretty one, but technology doesn't have to be pretty (on the inside) to be commercially successful. x86 is a huge success for multiple decades because it's a extendable patchwork.

I would say it's a success despite being very difficult to extend. If they had left just a few unused code bytes for future extensions, the coding would be much simpler today, and for example SSSE3 instructions would be one byte shorter. The ugly patches in the instruction coding scheme are not seen even by assembly programmers unless you study the details of how bits are coded. I had to do this because I have made a disassembler, but others are happily unaware how complicated it is.

You need a bit of economic perspective here. That's very little effort in the big picture. I've written an assembler as well and I've fixed encoding bugs in LLVM. Every new extension caused me another week of work, tops. But that's saving millions of other developers who use this software, from doing that same work. A 'cleaner' ISA makes no significant difference overall. It's never really a trivial task. And heck, LLVM's assembly encoding is in just a couple of files, while there are thousands of files for everything else within the project.

So ISA complexity or uglyness is pretty meaningless. It's merely a challenge for a handful of well paid engineers who have great job security. Nobody cares. It's what the ISA achieves that really matters to the rest of the world. So I stand by my point: x86 is successful because it's extendable, no matter how challenging it is.

Quote:

x87 and MMX are still used quite a bit though by legacy software that is still in use. I don't think they're costly to implement, after all they were introduced many years ago and now they only cost a tiny fraction of the transistor budget and it keeps getting cheaper. But their opcode space could be quite valuable.

Some CPUs have an extra pipeline stage just for handling the rolling x87 register stack. An extra pipeline stage means longer branch misprediction delays, also for non-x87 code. The rolling register stack is a completely different paradigm from the rest of the design.

Exactly. This is an example of trying to support another paradigm, solely for the ease of implementing a legacy expression evaluator, which ended up not being worth it in the longer run by not looking at the longer term picture. You'd be making the same kind of mistake when adding 8-bit maskable lanes to AVX-512 for short-term reasons. It doesn't mesh well with the more valuable and scalable paradigm of executing multiple scalar iterations of a loop in parallel. If you want 8-bit elements, make them part of vector-within-vector instructions and leave the 32-bit lanes alone.

0 Kudos
bronxzv
New Contributor II
3,359 Views

c0d1f1ed wrote:

AVX2 requires intrinsics, which takes a lot of hours and skilled engineers (read: expensive). AVX-512 makes it feasible to let the compiler to do it for you.

I'll be glad to learn which feature(s) of AVX-512 that are missing from AVX2 make it feasible to use the auto-vectorizer with AVX-512 but not AVX2 

0 Kudos
capens__nicolas
New Contributor I
2,655 Views

andysem wrote:

I'm sorry but I don't think I'm making a straw-man's argument. I really got the impression you're making a point that 8/16-bit operations are not needed (in AVX-512). Perhaps these words made me think so:

Quote:

c0d1f1ed wrote:

So I see no compelling reason why AVX-512 would need instructions and mask registers for 8-bit or 16-bit fields.

In any case, no offense was meant on my side.

And none was taken. I just think it's "cheap" to use the 640k argument without detailed argumentation. I mean, anything someone thinks would be a waste and would hamper future extensions could be attacked using the 640k argument. But it's a bad argument. GPUs have scaled performance at an incredible rate, despite processing 8-bit values in 32-bit lanes. So that goes squarely against the 640k argument.

Look, with scalar code nobody expects 8-bit arithmetic to be faster than 32-bit. Why would it have to be faster with SIMD parallelism? Except for vector-within-vector, it makes the hardware considerably more complex to have many lane widths.

I don't have the data to judge which workloads (on smaller or wider units) are more widespread, but I don't think that GPU computing makes a strong case here. GPUs are known to have a more narrow scope of applications than CPU instruction extensions. There are multiple reasons for that and I suspect that architectural limitations are not the last of them.

GPUs have a narrow scope of applications because they have to be programmed heterogeneously, and because they have low single-threaded performance. AVX-512 has neither of those issues. 8-bit and 16-bit vector instructions that are not vector-within-vector, are not going to help make the scope even wider. If you think otherwise please plus sum up some applications that would benefit from them. And please don't use the 640k anti-argument.

I won't argue that support for 8/16-bit elements is free. I'm just saying that this support is needed for a certain (and quite significant, IMHO) amount of applications. You also said that ALUs are cheap, so it seems there shouldn't be much of a problem putting them into silicon to support smaller elements for the benefit of those applications.

The problem isn't the ALUs. The problem is the masks. I am suggesting vector-within-vector instructions, which adds a tiny amount of ALU complexity, but keeps the masking simple.

However, I could understand if support for smaller elements was a too big step to make for the first implementation of AVX-512, so that the support is added later, in AVX-512-2 or whatever its name is. But in order for this to be possible, mask registers should also have a clear path of extension.

I asked Agner this same question: why would you want a different number of lanes for 32-bit, 16-bit and 8-bit elements? It breaks the paradigm of one loop iteration per lane, and it takes many shuffle instructions to switch between them. What's wrong with just keeping 8-bit data in 32-bit lanes, or using vector-within-vector instructions?

I don't have the background in designing CPUs, so my point of view is probably quite naive. I don't understand what you meant by "SIMD forces them to stay in lock-step with the elements (aka. strands/lanes) that do take those code paths." Why is it not practical to perform clock gating based on bits from xmm registers instead of k registers?

If you vectorize a loop which contains conditional statements, you need to execute all the paths that any of the elements of the vector are taking, and then blend the results together. But you're computing certain elements of the vectors that you're thowing away. The worst part of that is the wasted power.

The mask registers not only do the blending in the same instruction, they also allow to clock-gate the lanes which results are thown away anyway. This isn't possible with xmm registers because you can't read that many operands from the same register file per cycle at acceptable power consumption. You'd also need forwarding paths from the low 128-bit to the entire 512-bit width (or more), from every output to every input, splitting the bits up to each 8/16/32/64-bit element. That's not desirable either. With dedicated mask registers for 32/64-bit elements only, it gets much simpler.

Quote:

c0d1f1ed wrote:

Yes, some code could use more mask registers than available, but how is this worse than when having zero mask registers?

My point was that there is need for the operations to save and restore the registers. If xmm registers are used as masks then there already are such operations and the point is moot.

So your concern is the need for extra instructions when using dedicated mask registers? That's really not an issue compared to the major difficulties if vector registers were to be used as compact masks.

Quote:

c0d1f1ed wrote:

Something being just a little faster than the scalar version does not justify breaking AVX-512's programming model for. It aims to parallelize 32-bit code 16-fold. It seems that you're still stuck thinking about the old CPU vector instruction paradigms. The only way to understand AVX-512 and realize what you shouldn't try to force it to be is to consider it the unification of GPU technology into the CPU cores.

"A little faster" was somewhat an order of magnitude in terms of data throughput, IIRC, so it was quite significant. Maybe I'm thinking old ways, but I apply the suggested architectural improvements to the cases I have at hand and see that it doesn't apply that well.

I'm sure that wasn't an order of magnitude compared to using half the vector width. With AVX-512 we have the opportunity for 16x parallelization of a lot more code, with the potential for 32x in the future. That's way more valuable than doubling the performance of the AVX2 code you already have and get to keep. I mean, it's a simple choice: do you want 2x more but only for 8-bit data processing, or do you want 16x for a ton of applications? And again, vector-within-vector instructions can help you get that 2x or even 4x on top of that for 16-bit and 8-bit data respectively. The AVX-512 foundation seems very suitable to be extended that way, without needing any changes to how the mask registers work.

0 Kudos
Reply