Quote:andysem wrote: - Page 2

AFog0 · ‎10-11-2013

AVX512 is arguably the biggest step yet in the evolution of the x86 instruction set in terms of new instructions, new registers and new features. The first try was the Knights Corner instruction set. It had some problems and AVX512 is better, so I am quite happy that AVX512 seems to be replacing the Knights Corner instruction set. But there are still some shortsighted issues that are lilkely to cause problems for later extensions.

We have to learn from history. When the 64-bit mmx registers were replaced by the 128-bit xmm registers, nobody thought about preparing for the predictable next extension. The consequence of this lack of foresight is that we now have the complication of two versions of all xmm instructions and three states of the ymm register file. We have to issue a vzeroupper instruction before every call and return to ABI-compliant functions, or alternatively make two versions of all library functions, with and without VEX.

Such lack of foresight can be disastrous. Unfortunately, it looks like the AVX512 design is similarly lacking foresight. I want to point out two issues here that are particularly problematic:

AVX512 does not provide for clean extensions of the mask registers
The overloading of the register extension bits will mess up possible future expansions of the general purpose register space

First the new mask registers, k0 - k7. The manual says that these registers are 64 bits, yet there is no instruction to read or write more than 16 bits of a mask register. Thus, there is no way of saving and restoring a mask register that is compatible with the expected future extension to 64 bits. If it is decided to give some of the mask registers callee-save status, then there is no way of saving and restoring all 64 bits. We will be saving/restoring only 16 bits and zeroing the rest. Likewise, if an interrupt handler or device driver needs to use a mask register, it has no way of saving and restoring the full mask register short of saving the entire register file, which costs hundreds of clock cycles.

It is planned that the mask registers can grow to 64 bits, but not more, because they have to match the general purpose registers. Yet, we can predict already now that 64 bits will be insufficient within a few years. There seems to be plans to extend the vector registers to 1024 bits. Whether they should be extended further has perhaps not been decided yet (these extensions are certainly subject to diminishing returns). People are already now asking for an addition to AVX512 to support vector operations on 8-bit and 16-bit integers. A 1024 bit vector of 8-bit integers will require mask registers of 128 bits. There are apparently no plans for how the mask registers can be extended beyond 64 bits, so we will be needing another clumsy patch at that time.

Let me suggest a simple solution to this problem: Drop the mask registers and allow 8 of the vector registers to be used as mask registers. Then we can be certain that the registers used for masks will never become too small because a mask will always need fewer bits than the vector it is masking. We have 32 vector registers now, so we can certainly afford to use a few of them as mask registers. I think, generally, that it is bad to have many different register types. It delays task switching, it makes the ABI more complicated, it makes compilers more complicated, and it fills up the already crowded opcode space with similar instructions for different register types. The new instructions for manipulating mask registers will not be needed when we use xmm registers for masks, because the xmm instructions provide most of this functionality already, and much more.

So let me propose: Drop the new mask registers and the instructions for manipulating them. Allow seven of the vector registers (e.g. xmm1 - xmm7 or xmm25 - xmm31) to be used as mask registers. All mask functionality will be the same as currently specified by AVX512. This will make future extensions problem-free and allow the synergy of using the same instructions for manipulating vectors and manipulating masks.

The second issue I want to point out relates to doubling the number of registers. AVX512 doubles the number of vector registers from 16 to 32 in 64-bit mode. It is natural to ask whether the number of general purpose registers can also be doubled. In fact, it can, though this will be a little complicated. I have posted a comment on Intel's blog with a possible technical solution. I am not convinced that more general purpose registers will give a significant improvement in performance, but it is quite possible that we will need more registers in the future, perhaps for purposes that don't exist today. We should keep this in mind and keep the possibility open for having 32 general purpose registers in a future extension. Unfortunately, AVX512 is messing up this possibility by overloading the register extension bits. The X bit is reused for extending the B bit, and the V' bit is reused for extending the X bit. This is a patch that fits only a very narrow purpose. It will be a mess if these bits are needed in future extenstions for their original purpose. We need two more bits (B' and X') to make a clean extention of the register space. We can easily get one more bit by extending the 0x62 prefix byte into 0x60 and use bit 1 of the 60/62 prefix as e.g. register extension bit B'. The byte 0x60 is only vacant in 64-bit mode, but we don't need the register extension bit in 32-bit mode anyway. The bit that distinguishes AVX512 instructions from Knights Corner instructions can be used as the X' register extension bit. No CPU will ever be able to run both instruction sets, so we don't need this bit anyway.

There are other less attractive solutions in case the Knights Corner bit cannot be used, but anyway I think it is important to keep the possibility open for future extensions of the register space instead of messing up everything with short-sighted patches.

I will repeat what I have argued before, that instruction set extensions should be discussed in an open forum before they are implemented. This is the best way to prevent lapses and short-sighted decisions like these ones.

capens__nicolas · ‎11-08-2013

bronxzv wrote:

Quote:

c0d1f1edwrote:
AVX2 requires intrinsics, which takes a lot of hours and skilled engineers (read: expensive). AVX-512 makes it feasible to let the compiler to do it for you.

I'll be glad to learn which feature(s) of AVX-512 that are missing from AVX2 make it feasible to use the auto-vectorizer with AVX-512 but not AVX2

AVX2 is a great leap forward (vector-vector shift was long missing), but compilers still have trouble using it for auto-vectorization because not ever scalar operation within a loop translates to a single equally fast vector instruction. In particular, AVX2 lacks a fast gather implementation, and loops with branches are unlikely to be vectorized due to having to execute multiple paths that burn full power and requiring blend instructions. Dedicated mask registers could make a marked difference. Think about the case where a vectorized loop has only one element to process. AVX-512 code wouldn't be so bad for this worst case.

It also doesn't hurt to have the potential for 16x parallelization of 32-bit code instead of 'only' 8x. On top of the other features it should make compilers a lot more confident that vectorizing a candidate loop will result in faster execution.

bronxzv · ‎11-09-2013

c0d1f1ed wrote:
AVX2 is a great leap forward (vector-vector shift was long missing), but compilers still have trouble using it for auto-vectorization

the auto vectorizer in the Intel compiler does a good job already with AVX2 IMHO so your comment about the programmers requiring to use intrinsics is bogus, if you were right it will mean that we will have to wait for AVX-512 enabled CPUs before to use the autovectorizer or the CILK+ array notation, this is wrong and is potentially very misleading for newbies reading your post(s)

btw the latest Intel compiler features both AVX2 and AVX-512 targets, if you are right it should be easy to find plenty of examples that vectorize well for AVX-512 but not for AVX2, I'm afraid you'll not find a lot of examples besides the ones where scatter instructions are needed, on the other hand all the cases with 8-bit or 16-bit elements are typically vectorized with AVX2, in fact the best is to use a blend of AVX/AVX2/AVX-512 in the general case so it makes not much sense IMHO to say that one ISA is easier to auto-vectorize for than the other

c0d1f1ed wrote:
AVX2 lacks a fast gather implementation,

as you know this is implementation dependent, not ISA dependent, by the time CPUs with AVX-512 ship, AVX2 gather will be faster than it is in HSW, most probably at the same speed than AVX-512 since they will use common hardware for gather

anyway, here again, it is easy to test with current compilers and as a matter of fact today's Intel compiler (both autovectorizer and CILK+ array notation) generates vectorized code using AVX2 gather instructions

c0d1f1ed wrote:
and loops with branches are unlikely to be vectorized

they are typically vectorized using VBLENDVPS for branch elimination, btw a programmer will not be able to do much better using intrinsics, the fact that AVX-512 will allow a more power efficient implementation will be a welcomed enhancement but is not relevant to the fact that the current autovectorizers can do the job already for today's targets without the programmer having to use intrinsics, this leads already today to much faster and more energy efficient (J / work unit) code when compared to a scalar code equivalent

andysem · ‎11-09-2013

bronxzv wrote:

Quote:

andysemwrote:
By removing k registers you also remove all power consumption associated with them, don't you?

the k logical registers will most probably map to the same physical register file than GPRs (*1) thus "removing" the k registers will not save power since the register files will be actually kept unchanged, on the other hand doing the logical operations on 512-bit registers (as defined in the OP's proposal if I got it right) instead of 8-bit/16-bit masks will use arguably more power in pure waste, as you know probably a classical use case with masks is to compute a series of mask with compare instructions, then to AND or OR them together before to use only the final mask for actual masking, it's quite common to have more instructions for computing the masks than instructions using them so having the k logical registers of the smallest useful width is arguably a sensible choice to save power by moving around and computing up to 64x less bits (with packed doubles) than with full width zmm registers

*1: in the initial AVX-512 products, but maybe will map to the physical vector register file (as x87/xmm/ymm/zmm logical registers map to a common physical register file) in the future allowing for example 128-bit masks for AVX-1024 with 8-bit elements, it will be just a matter to define the new max width for k registers as 128-bit and to provide the necessary spill/fill instructions

You may be right in that using xmm registers for masks could be more power consuming, although reusing GPR file would probably have a negative effect on register renaming. If GPR file is increased to accomodate the new registers then its power use is increased as well. And there is another thing to consider. In vectorized code the majority of instructions don't even touch general purpose registers (except for load and store operations and the parts that are not vectorized). I suppose, it allows GPR file to consume less power than it would if the file was used for k registers. I don't know how big the difference is (and if there is any), some educated research is needed to weigh all pros and cons.

andysem · ‎11-09-2013

c0d1f1ed wrote:

Look, with scalar code nobody expects 8-bit arithmetic to be faster than 32-bit. Why would it have to be faster with SIMD parallelism?

Because scalar code by definition processes data units sequentially and vectorised code does it in parallel. The size of the vector (in units) defines the performance gain from vectorization. With SSE2 I'm able to process 16 bytes or 8 words at once, and AVX2 doubles that amount, so there is an obvious gain. Surely, you may not reach 2x speedup in a real application, but at least the potential is there and the real speedup is possible. With AVX-512 I'm able to process 16 bytes or words, which is worse than AVX2 in case of bytes and the same in case of words. Additionally, I have to perform conversions between 8/16-bit elements to 32-bit and back, which also takes time. This is hardly an improvement, likely the opposite.

You suggest using vector-within-vector approach, and if I understand you correctly and reading AVX-512 instructions right that's something like what people used before MMX, with general purpose registers. These tricks can be useful, as long as the algorithm is simple enough and you know the "pseudo-elements" within the "sub-vector" won't interfere with each other as the calculation goes. As soon as this doesn't hold the approach becomes not feasible and you're back to the scalar (in case of AVX-512 - 32-bit unit) code. The granularity of mask registers only complicates this because you're not able select units on byte or word level. Again, you have to resort to bit masks and logical operations, which in turn is tricky because of 32-bit only units. So no, I don't consider vector-within-vector approach as a suitable solution for the limitation AVX-512 makes.

Please, correct me if I misunderstood your vector-within-vector suggestion and you meant something different.

c0d1f1ed wrote:

GPUs have a narrow scope of applications because they have to be programmed heterogeneously, and because they have low single-threaded performance. AVX-512 has neither of those issues. 8-bit and 16-bit vector instructions that are not vector-within-vector, are not going to help make the scope even wider. If you think otherwise please plus sum up some applications that would benefit from them.

I'm currently interested in realtime multimedia processing, which includes video and audio processing (scaling, colorspace conversion, blending, mixing, etc.) and compression. I wrote quite a few algorithms for processing media for my employer, and for the most part these algorithms involve 8 and 16-bit operations on data. I generally try to avoid FP calculations because it's slower than integer/fixed point and I didn't find much use in 32-bit integer operations in my area. So my primary interest is 8 and 16-bit integer operations.

GPUs are probably better tailored for my tasks, but their use is not beneficial in my case for various reasons, technical and not. One of the reasons is too much overhead because of the need to transfer data between CPU and GPU memory. So I'm very much interested in increasing data processing performance on the CPU.

c0d1f1ed wrote:

The problem isn't the ALUs. The problem is the masks. I am suggesting vector-within-vector instructions, which adds a tiny amount of ALU complexity, but keeps the masking simple.

Sorry, but why applying a 16-bit mask to 32-bit lanes is less complex than applying a 64-bit mask to 8-bit lanes?

c0d1f1ed wrote:

I asked Agner this same question: why would you want a different number of lanes for 32-bit, 16-bit and 8-bit elements? It breaks the paradigm of one loop iteration per lane, and it takes many shuffle instructions to switch between them. What's wrong with just keeping 8-bit data in 32-bit lanes, or using vector-within-vector instructions?

I think I answered this above. I'll just add that it doesn't require any excessive amount of shuffle instructions. Probably, because the input and the output data are 8/16-bit in most of my cases.

c0d1f1ed wrote:

If you vectorize a loop which contains conditional statements, you need to execute all the paths that any of the elements of the vector are taking, and then blend the results together. But you're computing certain elements of the vectors that you're thowing away. The worst part of that is the wasted power.

The mask registers not only do the blending in the same instruction, they also allow to clock-gate the lanes which results are thown away anyway. This isn't possible with xmm registers because you can't read that many operands from the same register file per cycle at acceptable power consumption. You'd also need forwarding paths from the low 128-bit to the entire 512-bit width (or more), from every output to every input, splitting the bits up to each 8/16/32/64-bit element. That's not desirable either. With dedicated mask registers for 32/64-bit elements only, it gets much simpler.

I was thinking that sign bits of each lane would be used a s a mask bit, so no need to forward bits between different lanes. But if it's not possible or reasonable to implement the xmm/ymm/zmm register file so that it is able to serve for masks as well then ok, there's no choice but to have the separate mask registers. But that brings us back to the original concern - the suggested set of operations on the mask registers is incomplete and their extension course is uncertain wrt 8/16-bit units and larger vectors.

c0d1f1ed wrote:

I'm sure that wasn't an order of magnitude compared to using half the vector width. With AVX-512 we have the opportunity for 16x parallelization of a lot more code, with the potential for 32x in the future. That's way more valuable than doubling the performance of the AVX2 code you already have and get to keep. I mean, it's a simple choice: do you want 2x more but only for 8-bit data processing, or do you want 16x for a ton of applications? And again, vector-within-vector instructions can help you get that 2x or even 4x on top of that for 16-bit and 8-bit data respectively. The AVX-512 foundation seems very suitable to be extended that way, without needing any changes to how the mask registers work.

It's not like AVX-512 is introducing operations on 32-bit elements. The operations existed since SSE2, and while they did not offer 16x speedup, like for bytes, 4x is also a big gain. And AVX-512 brings 2x speedup for 32-bit operations compared to AVX2, just as it could be for bytes and words. I'm not trying to make a choice here between 32-bit and 8-bit, and I don't see why such a choice should even exist. I want performance gains for all kinds of applications.

bronxzv · ‎11-09-2013

andysem wrote:
although reusing GPR file would probably have a negative effect on register renaming. If GPR file is increased to accomodate the new registers

the integer PRF has 168 entries in Haswell for example http://www.realworldtech.com/haswell-cpu/3/

I don't see why the k registers will need anything more since as you say vector code put typically low pressure on the integer register file

[EDIT] on the other hand if the vector PRF was used for the masks in ZMM registers (as per OP proposal) it will be probably needed to add more entries (and a lot worse: to add ports to sustain a decent IPC with masks) to this (8x wider) structure to adapt to the heavy usage (think to an inner loop with most instructions using masks and a lot of FMA instructions), in the end there will be even more imbalance between vector and integer PRFs when executing vector code

Bernard · ‎11-09-2013

Hi bronxzv

Thanks for posting that link.There is a lot of valuable information.

BTW Do you have any info about uops encoding(horizontal or vertical)?

bronxzv · ‎11-09-2013

iliyapolak wrote:

Hi bronxzv

Thanks for posting that link.There is a lot of valuable information.

BTW Do you have any info about uops encoding(horizontal or vertical)?

I have no information about uops encoding and I suppose people with access to this information aren't allowed to disclose anything

Bernard · ‎11-09-2013

>>>I don't see why the k registers will need anything more since as you say vector code put typically low pressure on the integer register file>>>

And also on floating point register file(if at hardware level distinction is made between registers which operate on FP vector or those operating on integer vector code).Here I mean that for short sequences of Horner - like scheme code you will use at maximum 3-4 registers(architectural) per single term calculation.So there should not be a need to rename registers.

>>>I have no information about uops encoding and I suppose people with access to this information aren't allowed to disclose anything>>>

It seems that everything which is related to uops is closely kept as a secret.

capens__nicolas · ‎11-09-2013

andysem wrote:

Quote:

c0d1f1edwrote:
Look, with scalar code nobody expects 8-bit arithmetic to be faster than 32-bit. Why would it have to be faster with SIMD parallelism?

Because scalar code by definition processes data units sequentially and vectorised code does it in parallel.

It is still SIMD parallelism if you have 8-bit or 16-bit data in 32-bit lanes. So it's faster over the entire width, but does not have to be per lane. Again, GPUs achieve tremendous performance despite this due to the high total width. So there isn't one way it has to be done "by definition". Processing tightly packed 8-bit values comes at a cost, especially if you want each of them maskable with a predicate bit or want to shuffle them over a great distance. So you may want to compromise some 8/16-bit performance to keep 32/64-bit processing scalable. But you can get the best of both worlds with vector-within-vector instructions:

You suggest using vector-within-vector approach, and if I understand you correctly and reading AVX-512 instructions right that's something like what people used before MMX, with general purpose registers. These tricks can be useful, as long as the algorithm is simple enough and you know the "pseudo-elements" within the "sub-vector" won't interfere with each other as the calculation goes. As soon as this doesn't hold the approach becomes not feasible and you're back to the scalar (in case of AVX-512 - 32-bit unit) code. The granularity of mask registers only complicates this because you're not able select units on byte or word level. Again, you have to resort to bit masks and logical operations, which in turn is tricky because of 32-bit only units. So no, I don't consider vector-within-vector approach as a suitable solution for the limitation AVX-512 makes.

Please, correct me if I misunderstood your vector-within-vector suggestion and you meant something different.

Vector-within-vector means each SIMD lane executes a small vector operation independently from the other lanes. For instance if you have a loop where you add RGBA colors that have 8-bit components, this can be SIMD parallelized. The 'Single Instruction' part of SIMD is just a 4x8-bit vector operation in this case instead of a scalar operation. The whole instruction thus becomes a 16x4x8-bit vector-within-vector instruction in the case of AVX-512. The difference with a 64x8-bit instruction as you are requesting, is that with a vector-within-vector instruction the mask bits still predicate an entire 32-bit lane or in other words the 4x8-bit inner vectors, instead of each 8-bit element individually. This is perfectly fine, since your original loop contains 4x8-bit operations and any branching would happen at 32-bit granularity!

Quote:

c0d1f1edwrote:
GPUs have a narrow scope of applications because they have to be programmed heterogeneously, and because they have low single-threaded performance. AVX-512 has neither of those issues. 8-bit and 16-bit vector instructions that are not vector-within-vector, are not going to help make the scope even wider. If you think otherwise please plus sum up some applications that would benefit from them.

I'm currently interested in realtime multimedia processing, which includes video and audio processing (scaling, colorspace conversion, blending, mixing, etc.) and compression. I wrote quite a few algorithms for processing media for my employer, and for the most part these algorithms involve 8 and 16-bit operations on data. I generally try to avoid FP calculations because it's slower than integer/fixed point and I didn't find much use in 32-bit integer operations in my area. So my primary interest is 8 and 16-bit integer operations.

Then I think vector-within-vector instructions would be right for you, offering the full power of 512-bit without the need for 8-bit masking granularity.

GPUs are probably better tailored for my tasks, but their use is not beneficial in my case for various reasons, technical and not. One of the reasons is too much overhead because of the need to transfer data between CPU and GPU memory. So I'm very much interested in increasing data processing performance on the CPU.

I agree. GPGPU is a minefield and even with AMD's HSA efforts there will still be too many variants which each have their own pitfalls. That's too hard for the average developer, or better yet compiler, to master. Unless your workload is 'embarassingly parallel' the ROI for using the GPU doesn't add up, and with things like AVX-512 it will keep diminishing.

Quote:

c0d1f1edwrote:
The problem isn't the ALUs. The problem is the masks. I am suggesting vector-within-vector instructions, which adds a tiny amount of ALU complexity, but keeps the masking simple.

Sorry, but why applying a 16-bit mask to 32-bit lanes is less complex than applying a 64-bit mask to 8-bit lanes?

First of all you shouldn't compare just two widths which only differ by 2x. This is about supporting 8/16/32/64-bit mask granularity or just 32/64-bit. In the case of AVX-512 you'd have to route the lower 8 mask bits to 8x64-bit, the lower 16 bits to 16x32-bit, the lower 32-bits to 32x16-bit and 64-bit to 64x8-bit. That's a total of 120 bits running 'horizontally' over a great distance (assuming the SIMD lanes run vertically), instead of just 24. And not just that, you need to route two signal bits instead of one to select which of these four should be used by the next instruction, based on the instruction type. Then there's three vector execution ports per port, with integer and float domains . So supporting four instead of two predication granularities adds various gate and wire delays that have to be taken into account which either make the design slower or consume more power.

GPUs use clever tricks to support both 32-bit and 64-bit lanes with a minimum of cross-lane communication, and I imagine AVX-512 implementations will use similar tricks. The physical layout can differ quite a bit from the logical layout (so forget everything I said about vertical and horizontal). But I'm sure that supporting 16-bit and 8-bit granularity of predication significantly complicates things.

Quote:

c0d1f1edwrote:
I'm sure that wasn't an order of magnitude compared to using half the vector width. With AVX-512 we have the opportunity for 16x parallelization of a lot more code, with the potential for 32x in the future. That's way more valuable than doubling the performance of the AVX2 code you already have and get to keep. I mean, it's a simple choice: do you want 2x more but only for 8-bit data processing, or do you want 16x for a ton of applications? And again, vector-within-vector instructions can help you get that 2x or even 4x on top of that for 16-bit and 8-bit data respectively. The AVX-512 foundation seems very suitable to be extended that way, without needing any changes to how the mask registers work.

It's not like AVX-512 is introducing operations on 32-bit elements. The operations existed since SSE2, and while they did not offer 16x speedup, like for bytes, 4x is also a big gain. And AVX-512 brings 2x speedup for 32-bit operations compared to AVX2, just as it could be for bytes and words. I'm not trying to make a choice here between 32-bit and 8-bit, and I don't see why such a choice should even exist. I want performance gains for all kinds of applications.

Yes you get 4x32-bit with SSE2 and 8x32-bit with AVX2, but their use is more limited than AVX-512, even it were restricted to 128-bit and 256-bit respectively. AVX-512 adds things that make it highly suitable for vectorizing generic loops. A fast gather operation is essential when you're doing any indexed addressing, predication masks keep branches efficient, broadcast allows to have scalar constants, etc. And as I said before, it took until AVX2 to get vector-vector shift. So anything before that was unable to easily vectorize loops containing a shift operation.

So it's important to realize that we'll suddenly see a whole lot more applications benefit from SIMD. And not just 4x. We'll get up to 16x. And on top of that TSX is helping multi-core performance. So it's a few things that in isolation are only evolutionary, but combined result in a relatively sudden revolutiony change in the CPU's capabilities. Mark my words, it will be a new era in computing. You no longer have to think is this code threadable or vectorizable and on what device should I run it. It's all just code, and with the help of the compiler the CPU will extract any kind of parallelism that's in there.

capens__nicolas · ‎11-09-2013

andysem wrote:
And there is another thing to consider. In vectorized code the majority of instructions don't even touch general purpose registers (except for load and store operations and the parts that are not vectorized). I suppose, it allows GPR file to consume less power than it would if the file was used for k registers.

Even just for load/store pointers and indices (e.g. the loop iterator), it amounts to a significant number of scalar register accesses for otherwise highly data-parallel code. So the scalar register file is always in use and from a performance/Watt perspective it would be more wasteful to not make use of the additional read ports. It's burning quite a lot of power anyway, so you might as well access 2-3 registers per cycle instead of ~1 for data-parallel code. It makes perfect sense to store the k registers in the scalar register file, and explains their 64-bit size.

And it's not just the register file itself. Renaming and scheduling are expensive stages too, so you want to reuse all of that.

andysem · ‎11-10-2013

c0d1f1ed wrote:

Quote:

You suggest using vector-within-vector approach, and if I understand you correctly and reading AVX-512 instructions right that's something like what people used before MMX, with general purpose registers. These tricks can be useful, as long as the algorithm is simple enough and you know the "pseudo-elements" within the "sub-vector" won't interfere with each other as the calculation goes. As soon as this doesn't hold the approach becomes not feasible and you're back to the scalar (in case of AVX-512 - 32-bit unit) code. The granularity of mask registers only complicates this because you're not able select units on byte or word level. Again, you have to resort to bit masks and logical operations, which in turn is tricky because of 32-bit only units. So no, I don't consider vector-within-vector approach as a suitable solution for the limitation AVX-512 makes.
Please, correct me if I misunderstood your vector-within-vector suggestion and you meant something different.

Vector-within-vector means each SIMD lane executes a small vector operation independently from the other lanes.

I haven't seen such instructions in AVX-512, and I haven't come across any references to future extensions that introduce them. Do you have such references?

In any case, such instructions are much less useful than the true support for 8/16-bit elements, see below.

c0d1f1ed wrote:

For instance if you have a loop where you add RGBA colors that have 8-bit components, this can be SIMD parallelized. The 'Single Instruction' part of SIMD is just a 4x8-bit vector operation in this case instead of a scalar operation. The whole instruction thus becomes a 16x4x8-bit vector-within-vector instruction in the case of AVX-512. The difference with a 64x8-bit instruction as you are requesting, is that with a vector-within-vector instruction the mask bits still predicate an entire 32-bit lane or in other words the 4x8-bit inner vectors, instead of each 8-bit element individually. This is perfectly fine, since your original loop contains 4x8-bit operations and any branching would happen at 32-bit granularity!

RGBA is just a special case, which is typically met in image processing. In video processing you typically deal with some variation of YUV colorspace, which is stored in planar format. The algorithm processes each plane individually, and every byte in the plane corresponds to a pixel (or several pixels) of the image. You have to apply the mask to 8-bit elements of the vector.

In audio processing, 16-bit samples are dominant nowdays, and since samples are stored sequentially, this requires masking 16-bit elements of the vector. 32-bit masking might be ok for the interleaved stereo case, but again, this is just a special case.

In string processing, you hardly ever deal with any interleaved data or with elements larger than 8 bits. UTF16 is widespread on Windows, but that's an exception and still isn't 32-bit.

In general I'd say that any case with interleaved data streams is more an exception than a rule, so creating hardware instructions to target particularly these cases seems unwise to me. Native support for 8/16-bit elements, on the other hand, would be very welcome.

c0d1f1ed wrote:

Quote:

Quote:

Sorry, but why applying a 16-bit mask to 32-bit lanes is less complex than applying a 64-bit mask to 8-bit lanes?

First of all you shouldn't compare just two widths which only differ by 2x. This is about supporting 8/16/32/64-bit mask granularity or just 32/64-bit. In the case of AVX-512 you'd have to route the lower 8 mask bits to 8x64-bit, the lower 16 bits to 16x32-bit, the lower 32-bits to 32x16-bit and 64-bit to 64x8-bit. That's a total of 120 bits running 'horizontally' over a great distance (assuming the SIMD lanes run vertically), instead of just 24. And not just that, you need to route two signal bits instead of one to select which of these four should be used by the next instruction, based on the instruction type. Then there's three vector execution ports per port, with integer and float domains . So supporting four instead of two predication granularities adds various gate and wire delays that have to be taken into account which either make the design slower or consume more power.

GPUs use clever tricks to support both 32-bit and 64-bit lanes with a minimum of cross-lane communication, and I imagine AVX-512 implementations will use similar tricks. The physical layout can differ quite a bit from the logical layout (so forget everything I said about vertical and horizontal). But I'm sure that supporting 16-bit and 8-bit granularity of predication significantly complicates things.

If mask registers were designed with byte granularity in mind, all that complexity would be unnecessary. Just let k registers be 64-bit from the start, every bit would mask the corresponding byte in the vector. Larger-granularity masking would just use more than one bit per lane or just one (e.g. highest) bit to mask out the operation on the element. Comparison instructions would also set bits corresponding to the operation granularity (like pcmpgt/pcmpeq do now with xmm registers). No need for wiring different bits to different lanes at all.

Vector width extension then comes naturally with k registers width extension. I suppose, at that point the integer PRF wouldn't be enough to store the extended k registers, unless one k register is multiplexed from two physical registers in the file.

c0d1f1ed wrote:

Yes you get 4x32-bit with SSE2 and 8x32-bit with AVX2, but their use is more limited than AVX-512, even it were restricted to 128-bit and 256-bit respectively. AVX-512 adds things that make it highly suitable for vectorizing generic loops. A fast gather operation is essential when you're doing any indexed addressing, predication masks keep branches efficient, broadcast allows to have scalar constants, etc. And as I said before, it took until AVX2 to get vector-vector shift. So anything before that was unable to easily vectorize loops containing a shift operation.

So it's important to realize that we'll suddenly see a whole lot more applications benefit from SIMD. And not just 4x. We'll get up to 16x. And on top of that TSX is helping multi-core performance. So it's a few things that in isolation are only evolutionary, but combined result in a relatively sudden revolutiony change in the CPU's capabilities. Mark my words, it will be a new era in computing. You no longer have to think is this code threadable or vectorizable and on what device should I run it. It's all just code, and with the help of the compiler the CPU will extract any kind of parallelism that's in there.

While I agree that many features you refer to are very welcome, I wouldn't call AVX-512 a revolutionary extension and say that everything before it was not beneficial for 32-bit vectorization. Yes, AVX-512 opens new possibilities, and surely new applications and compilers will make use of it, eventually. But I also understand that most current (and near future) performance-critical code is tailored specifically for SSE/AVX, either by efforts of developers or compiler (for which I personally have little faith). The code that is not vectorized is either (a) not performance critical or (b) too complex to be vectorized. AVX-512 may help (b) but chances are high that the code will still be too complex for it as well, because otherwise it would have been rewritten for SSE/AVX already. So I can't agree with your 16x estimate.

In any case, my point was that you shouldn't restrict yourself to 32/64-bit only because the real world use cases are much more diverse. GPUs pull off the performance race because they are much more parallel than AVX2 or AVX-512, even though they are restricted to 32/64-bit. With shorter vectors, CPUs should be better suited for smaller data units which are actually commonly used in applications.

AFog0 · ‎11-10-2013

As I understand c0d1f1ed, your argument is that 32-bit granularity with clock-gating masks is a new paradigm and that 8-bit granularity is more costly in terms of routing clock gates, far distance shuffle instructions, and gather instructions. What you call vector-within-vector instructions is just a vector instruction with 8-bit or 16-bit elements, but masking with 32-bit granularity. I think it is quite cheap to extend the ALU so that it can handle carry at 8-bit granularity. Full-width shuffle and gather instructions with 8-bit granularity are not available with AVX2 or AVX512 and I doubt that they will ever be available. You have to use two-step shuffle for that.

Whether it is too expensive to mask with 8-bit or 16-bit granularity is difficult to argue when the people who know the hardware details seem to be gagged. Even if it is expensive to do low-granularity masking today, it may be cheap tomorrow, or they may have to do it anyway because of demand from SW people. My point is that the design should be extensible because we can't predict the future. The history of x86 instruction sets is full of examples of very shortsighted solutions with no concern for extensibility. They are doing the same mistake again with AVX512 by making mask registers with no plan for extension beyond 64 bits.

bronxzv · ‎11-11-2013

[EDITED]

andysem wrote:
In any case, such instructions are much less useful than the true support for 8/16-bit elements, see below.

indeed, and btw Intel has never said it will not introduce full support (clean SoA support, not a clumsy AoS within SoA mess) for 8-bit and 16-bit elements in the future, at least scatter/gather support for 8-bit and 16-bit elements (case in point FP16 half-floats useful also for FP code) is much needed even for 16-way SIMD only

predicting the future of Intel's vector ISAs is a difficult art, as shown for example in this (not so old) c0d1f1ed's post : http://software.intel.com/en-us/comment/reply/277741/1460656

andysem wrote:

Just let k registers be 64-bit from the start,

k registers are already defined as 64-bit, it will be easy to add in due time the 32-bit/64-bit spill/fill instructions, if the ABI requires a callee to not modify some k registers you'll have to recompile it, it doesn't look like a significant drawback IMHO, unlike the OP makes it sound

andysem · ‎11-11-2013

bronxzv wrote:

Quote:

andysemwrote:
Just let k registers be 64-bit from the start,

k registers are already defined as 64-bit, it will be easy to add in due time the 32-bit/64-bit spill/fill instructions, if the ABI require a callee to not modify some k registers you'll have to recompile it, it doesn't look like a significant drawback, unlike the OP make it sound

What I meant is that all 64 bits could be used to simplify signal routing and add support for 8/16-bit vector elements. I.e. in case of 8-bit elements every bit of the mask is in effect, in case of 16-bit elements - bits 1, 3, 5 and so on, 32-bit - 3, 7, 11 and so on, 64-bit - 7, 15, 23 and so on.

bronxzv · ‎11-11-2013

[EDITED]

andysem wrote:

Quote:

bronxzvwrote:
Quote:

andysemwrote:

Just let k registers be 64-bit from the start,

k registers are already defined as 64-bit, it will be easy to add in due time the 32-bit/64-bit spill/fill instructions, if the ABI require a callee to not modify some k registers you'll have to recompile it, it doesn't look like a significant drawback, unlike the OP make it sound

What I meant is that all 64 bits could be used to simplify signal routing and add support for 8/16-bit vector elements. I.e. in case of 8-bit elements every bit of the mask is in effect, in case of 16-bit elements - bits 1, 3, 5 and so on, 32-bit - 3, 7, 11 and so on, 64-bit - 7, 15, 23 and so on.

ah yes, I see what you mean, I can't comment on the simplified routing thing but as a programmer using only the LSBs looks far easier to grasp because it is consistent with the values returned by VMOVMSKPS/VMOVMSKPD/VPMOVMSKB etc., otherwise we will have 3 types of masks (the legacy SSE/AVX masks with the MSB of each element used as mask bit, the packed masks returned by VMOVMSKPS and the like, the new masks as per your definition)

capens__nicolas · ‎11-11-2013

andysem wrote:

c0d1f1ed wrote:
Vector-within-vector means each SIMD lane executes a small vector operation independently from the other lanes.

I haven't seen such instructions in AVX-512, and I haven't come across any references to future extensions that introduce them. Do you have such references?

I don't know of any that have been specified already. But my point is that AVX-512 can easily be extended to support them, without any changes to the predication masks. It's about thinking long term. In the shorter term, 32/64-bit operations are the most valuable use of those extra transistors so that's what we're getting first.

In any case, such instructions are much less useful than the true support for 8/16-bit elements, see below.

c0d1f1ed wrote:
For instance if you have a loop where you add RGBA colors that have 8-bit components, this can be SIMD parallelized. The 'Single Instruction' part of SIMD is just a 4x8-bit vector operation in this case instead of a scalar operation. The whole instruction thus becomes a 16x4x8-bit vector-within-vector instruction in the case of AVX-512. The difference with a 64x8-bit instruction as you are requesting, is that with a vector-within-vector instruction the mask bits still predicate an entire 32-bit lane or in other words the 4x8-bit inner vectors, instead of each 8-bit element individually. This is perfectly fine, since your original loop contains 4x8-bit operations and any branching would happen at 32-bit granularity!

RGBA is just a special case, which is typically met in image processing. In video processing you typically deal with some variation of YUV colorspace, which is stored in planar format. The algorithm processes each plane individually, and every byte in the plane corresponds to a pixel (or several pixels) of the image. You have to apply the mask to 8-bit elements of the vector.

In audio processing, 16-bit samples are dominant nowdays, and since samples are stored sequentially, this requires masking 16-bit elements of the vector. 32-bit masking might be ok for the interleaved stereo case, but again, this is just a special case.

In string processing, you hardly ever deal with any interleaved data or with elements larger than 8 bits. UTF16 is widespread on Windows, but that's an exception and still isn't 32-bit.

In general I'd say that any case with interleaved data streams is more an exception than a rule, so creating hardware instructions to target particularly these cases seems unwise to me. Native support for 8/16-bit elements, on the other hand, would be very welcome.

I know about all those use cases. But you haven't given me a compelling reason why these should be predicatable at an 8/16-bit granularity. So far we've been able to live without predication at all, by using blend instructions and logic operations. They work just fine. And with vector-within-vector instructions, there would be absolutely no difference with what has been available in the past.

Predication masks are orthogonal to that. The only value they add is that the blend happens as part of the same instruction, and it can get clock gated per lane. In my opinion those features aren't very important to the use cases you've described. You need code with a significant number of branches with a fair bit of divergence, before this complexity starts to pay off. Parallelizable code with 8-bit or 16-bit values generally does not fall into that category. And even if it does, you still get the choice between using blend instructions or storing them in 32-bit lanes. Depending on the situation, one of these is prefectly acceptable. Again, we've gone without predication masks for ages, so it is not worth losing 32/64-bit scalability by demanding 8/16-bit predication granularity.

c0d1f1ed wrote:
First of all you shouldn't compare just two widths which only differ by 2x. This is about supporting 8/16/32/64-bit mask granularity or just 32/64-bit. In the case of AVX-512 you'd have to route the lower 8 mask bits to 8x64-bit, the lower 16 bits to 16x32-bit, the lower 32-bits to 32x16-bit and 64-bit to 64x8-bit. That's a total of 120 bits running 'horizontally' over a great distance (assuming the SIMD lanes run vertically), instead of just 24. And not just that, you need to route two signal bits instead of one to select which of these four should be used by the next instruction, based on the instruction type. Then there's three vector execution ports per port, with integer and float domains . So supporting four instead of two predication granularities adds various gate and wire delays that have to be taken into account which either make the design slower or consume more power.

GPUs use clever tricks to support both 32-bit and 64-bit lanes with a minimum of cross-lane communication, and I imagine AVX-512 implementations will use similar tricks. The physical layout can differ quite a bit from the logical layout (so forget everything I said about vertical and horizontal). But I'm sure that supporting 16-bit and 8-bit granularity of predication significantly complicates things.

If mask registers were designed with byte granularity in mind, all that complexity would be unnecessary. Just let k registers be 64-bit from the start, every bit would mask the corresponding byte in the vector. Larger-granularity masking would just use more than one bit per lane or just one (e.g. highest) bit to mask out the operation on the element. Comparison instructions would also set bits corresponding to the operation granularity (like pcmpgt/pcmpeq do now with xmm registers). No need for wiring different bits to different lanes at all.

That would be an excellent suggestion if not for the fact that AVX-512 would already use all 64-bit of the k registers. Extending to 1024-bit would get very messy. I don't think that's a smart move for a predication granularity that's not going to be of great value anyway.

Vector width extension then comes naturally with k registers width extension. I suppose, at that point the integer PRF wouldn't be enough to store the extended k registers, unless one k register is multiplexed from two physical registers in the file.

Then you need two register file accesses per predication mask (up to a total of six per cycle, without counting any pointer and index registers). Also, you'd easily run out of physical registers. This sounds like a costly hack to me, and again I don't think it's worth the effort.

c0d1f1ed wrote:
Yes you get 4x32-bit with SSE2 and 8x32-bit with AVX2, but their use is more limited than AVX-512, even it were restricted to 128-bit and 256-bit respectively. AVX-512 adds things that make it highly suitable for vectorizing generic loops. A fast gather operation is essential when you're doing any indexed addressing, predication masks keep branches efficient, broadcast allows to have scalar constants, etc. And as I said before, it took until AVX2 to get vector-vector shift. So anything before that was unable to easily vectorize loops containing a shift operation.

So it's important to realize that we'll suddenly see a whole lot more applications benefit from SIMD. And not just 4x. We'll get up to 16x. And on top of that TSX is helping multi-core performance. So it's a few things that in isolation are only evolutionary, but combined result in a relatively sudden revolutiony change in the CPU's capabilities. Mark my words, it will be a new era in computing. You no longer have to think is this code threadable or vectorizable and on what device should I run it. It's all just code, and with the help of the compiler the CPU will extract any kind of parallelism that's in there.

While I agree that many features you refer to are very welcome, I wouldn't call AVX-512 a revolutionary extension and say that everything before it was not beneficial for 32-bit vectorization. Yes, AVX-512 opens new possibilities, and surely new applications and compilers will make use of it, eventually. But I also understand that most current (and near future) performance-critical code is tailored specifically for SSE/AVX, either by efforts of developers or compiler (for which I personally have little faith). The code that is not vectorized is either (a) not performance critical or (b) too complex to be vectorized. AVX-512 may help (b) but chances are high that the code will still be too complex for it as well, because otherwise it would have been rewritten for SSE/AVX already. So I can't agree with your 16x estimate.

It's all about ROI. 4x isn't that compelling to most developers, considering they have to learn SSE2 intrinsics and all the quirks and limitations, leaving 2x at best in most cases. 8x peak gets more interesting but again most developers just won't leave the realm of their high-level language. 16x and the ability to have the compiler take care of it all, now that's hard to pass up on. Even if some performance is lost in the process, it's a pretty sure deal.

Note that AMD is betting the farm on HSA. The potential peak return is roughly the same as AVX-512, but the investment and risks are considerably higher. So don't underestimate what Intel is going to achieve here. AMD definitely thinks there's enough applications that would benefit from the raw processing power of the GPU's wide SIMD units. But heterogeneous is inherently more complex and has more overhead. So if HSA is supposed to be revolutionary, then I don't think AVX-512 should be regarded as anything less.

In any case, my point was that you shouldn't restrict yourself to 32/64-bit only because the real world use cases are much more diverse. GPUs pull off the performance race because they are much more parallel than AVX2 or AVX-512, even though they are restricted to 32/64-bit. With shorter vectors, CPUs should be better suited for smaller data units which are actually commonly used in applications.

GPUs are not more parallel than AVX-512. Kaveri, AMD's latest APU, can only do 740 GFLOPS in the GPU. A quad-core with AVX-512 would be capable of 768 GFLOPS at 3 GHz. But you could easily fit 6 or 8 cores on a die instead if you don't waste area on a GPU.

Sure, the GPU is more "parallel" in the sense that it does more things in parallel, but it does them far more slowly. So in practice the compute density is about the same. GPU manufacturers claim slower but wider is more power efficient. Future CPUs might achieve the same thing by having two clusters of SIMD units which alternatingly execute AVX-1024 instructions on 512-bit units in two cycles, with each cluster dedicated to one thread.

So I don't think CPUs are any more or any less suited for small data. And even though such workloads are fairly common, I'm still not convinced that they would require 8/16-bit predication granularity.

andysem · ‎11-11-2013

bronxzv wrote:

Quote:

andysemwrote: What I meant is that all 64 bits could be used to simplify signal routing and add support for 8/16-bit vector elements. I.e. in case of 8-bit elements every bit of the mask is in effect, in case of 16-bit elements - bits 1, 3, 5 and so on, 32-bit - 3, 7, 11 and so on, 64-bit - 7, 15, 23 and so on.

ah yes, I see what you mean, I can't comment on the simplified routing thing but as a programmer using only the LSBs looks far easier to grasp because it is consistent with the values returned by VMOVMSKPS/VMOVMSKPD/VPMOVMSKB etc., otherwise we will have 3 types of masks (the legacy SSE/AVX masks with the MSB of each element used as mask bit, the packed masks returned by VMOVMSKPS and the like, the new masks as per your definition)

The masks returned by movmsk* instructions are actually very similar to those described in AVX-512, in k registers (i.e. every bit in the mask is effective), while the sparse masks I described are similar to the masks in xmm/ymm registers (i.e. only MSB in the group of bits is the effective one). I don't think there will be much confusion, when the concept is understood that way. The additional benefit of the sparse masks is that they become independent of the vector granularity. You can create the mask by a 32-bit operation and then use it in 8 or 16-bit context without any changes. The programmer's interface is also simplified, since there would be no need for __mmask8, __mmask16, etc. types but just __mmask64. The drawback is that movmsk* instructions won't be useful for mask construction, but given their poor performance and availability of the new cmp* instructions I don't think this would be an issue.

andysem · ‎11-12-2013

c0d1f1ed wrote:

I know about all those use cases. But you haven't given me a compelling reason why these should be predicatable at an 8/16-bit granularity. So far we've been able to live without predication at all, by using blend instructions and logic operations. They work just fine. And with vector-within-vector instructions, there would be absolutely no difference with what has been available in the past.

Predication masks are orthogonal to that. The only value they add is that the blend happens as part of the same instruction, and it can get clock gated per lane. In my opinion those features aren't very important to the use cases you've described. You need code with a significant number of branches with a fair bit of divergence, before this complexity starts to pay off. Parallelizable code with 8-bit or 16-bit values generally does not fall into that category. And even if it does, you still get the choice between using blend instructions or storing them in 32-bit lanes. Depending on the situation, one of these is prefectly acceptable. Again, we've gone without predication masks for ages, so it is not worth losing 32/64-bit scalability by demanding 8/16-bit predication granularity.

Provided that Intel adds vector-within-vector operations, that would make predication useless for a considerable range of applications. Don't you think this is a somewhat wasted investment? Predication is a general new feature, and limiting it to 32/64-bit only seems unreasonable to me.

Yes, we currently use blend and logical operations to solve branching cases, but what makes you think media and string algorithms wouldn't benefit from replacing them with predication? You described the benefits yourself. Depending on hardware implementation, I imagine there could even be some throughput gains, if the CPU is able to execute more instructions in parallel if more of the elements of the operands are masked out. IMHO, predication should replace blend operations almost entirely.

c0d1f1ed wrote:

Quote:

If mask registers were designed with byte granularity in mind, all that complexity would be unnecessary. Just let k registers be 64-bit from the start, every bit would mask the corresponding byte in the vector. Larger-granularity masking would just use more than one bit per lane or just one (e.g. highest) bit to mask out the operation on the element. Comparison instructions would also set bits corresponding to the operation granularity (like pcmpgt/pcmpeq do now with xmm registers). No need for wiring different bits to different lanes at all.

That would be an excellent suggestion if not for the fact that AVX-512 would already use all 64-bit of the k registers. Extending to 1024-bit would get very messy. I don't think that's a smart move for a predication granularity that's not going to be of great value anyway.

Quote:

Vector width extension then comes naturally with k registers width extension. I suppose, at that point the integer PRF wouldn't be enough to store the extended k registers, unless one k register is multiplexed from two physical registers in the file.

Then you need two register file accesses per predication mask (up to a total of six per cycle, without counting any pointer and index registers). Also, you'd easily run out of physical registers. This sounds like a costly hack to me, and again I don't think it's worth the effort.

Well, it's hard for me to judge how difficult such an extension would be, but it looks worthy to me. There are solutions for this problem, besides multiplexing k registers. x86-128 IA32 extention, for example :-D. Seriously though, k registers could be extracted to a separate file, which could be dormant most of the time, so the power consumption is not increased much. At some point, I think, 128-bit registers may appear anyway, since the need for larger precision numbers is already present in scientific/math applications.

All in all, I'm not arguing that full support for 8/16-bit operations and predicaion comes without a cost. My opinion is that the demand for it is significant enough to justify the possible complication. Vectors won't continue to grow much after 512 bits (1024 - yes, 2048? don't know), so the amount of complication is limited and predictable.

bronxzv · ‎11-12-2013

andysem wrote:
The masks returned by movmsk* instructions are actually very similar to those described in AVX-512,

yes, so we have basically 2 types of mask, thus my comment "otherwise we will have 3 types of masks"

andysem wrote:
The programmer's interface is also simplified, since there would be no need for __mmask8, __mmask16, etc. types but just __mmask64.

well I suppose it's a matter of personal taste, I strongly (pun intended) prefer strong typing as offered by __m512 , __m512i __m512d etc,

andysem wrote:
drawback is that movmsk* instructions won't be useful for mask construction, but given their poor performance and availability of the new cmp* instructions I don't think this would be an issue.

movmsk* instructions aren't poor performance according to my experience, at least not on Intel's CPUs

andysem · ‎11-12-2013

bronxzv wrote:

Quote:

andysem wrote:The masks returned by movmsk* instructions are actually very similar to those described in AVX-512,

yes, so we have basically 2 types of mask, thus my comment "otherwise we will have 3 types of masks"

Well, since they have the same representation and semantics, I don't separate them. But ok.

bronxzv wrote:

Quote:

andysem wrote:The programmer's interface is also simplified, since there would be no need for __mmask8, __mmask16, etc. types but just __mmask64.

well I suppose it's a matter of personal taste, I strongly (pun intended) prefer strong typing as offered by __m512 , __m512i __m512d etc,

The parallel is not correct, the mask is always a mask. It's the number of bits that differ, not their semantics (i.e. integer vs FP). You do use __m512i to store different sized integers, after all.

bronxzv wrote:

Quote:

andysem wrote:drawback is that movmsk* instructions won't be useful for mask construction, but given their poor performance and availability of the new cmp* instructions I don't think this would be an issue.

movmsk* instructions aren't poor performance according to my experience, at least not on Intel's CPUs

movmsk* are slower than just cmp* instructions, especially on older CPUs. Surely, we don't have timings for AVX-512 yet but I hope cmp* in AVX-512 to have performance comparable to the previous extensions.

bronxzv · ‎11-12-2013

andysem wrote:
movmsk* are slower than just cmp* instructions, especially on older CPUs.

they are 1 clock throughput / 2 clock latency since Conroe or even before IIRC

another advantage for the AVX-512 tight masks (vs your sparse masks proposal) that comes to mind is that they are directlly suited to access look up tables, it's doable to have a LUT with 16-bit indices, but obviously not with 64-bit indices

AVX-512 is a big step forward - but repeating past mistakes!