I'd like to make a few suggestions for additional SIMD instructions.
My main concern is the (automatic) vectorization of loops. SSE/AVX still miss a few instructions to make that something straightforward. Basically, it requires that every common operation (typically 32-bit) in a language like C has a vector equivalent.
A first missing instruction is a shift with independent shift amounts. SSE5 does have such an instrution (pshad). It would make it possible to vectorize a loop containing shift operations with non-constant shift amounts (the alternative code sequence is very long, while it should be fairly simple to implement in hardware).
Other missing operations are parallelscatter/gather. That's the vector equivalent of simple store/load operations. It would require 16 instructions to perform something like a gather operation with extract/insert instructions, for 8 x 32-bit elements. Compared to arithmetic operations that require just one instruction it's clear that thisquickly becomesa huge bottleneck.
Of course it can be argued that one should use Structure-of-Array data layouts and store/load entire vectors in one instruction, but that's simply not always possible. Automatic vectorizing compilers can't simply change the data layout. And even when the application developer is aware of the benefits of SoA some algorithms just require scatter/gather. Even for something like multiplying matrices a gather instruction would be of great help.
Other fields in which these instructions would be useful are multimedia, HPC, physics, ray-tracing, etc. Basically anything that uses a stream processing model with loops that can be parallelized with SIMD. My own field op expertise is 3D software rendering(for games, medical imaging, etc). In particular the gather instruction would help very considerably with texture sampling, especially with large filter kernels (anisotropic filtering).
I realize it's a huge challenge to implement a fast but generic gather instruction but it looks like a crucial step to increase SIMD performance. As vectors get wider and arithmetic performance goes up,gathering data elements from different memory locations is quickly becoming the bottleneck and preventing the processor from ever reaching its theoretical throughput.
Note that a first implementation doesn't necessarily have to be capable of loading 8 x 32-bit elements all in parallel. Even loading two or four elements per cycle would be a worthwhile improvement compared to the alternatives. Also, memory ordering doesn't have to be that strict. For stream processing the input and output typically doesn't overlap, and to synchronize with other threadsthe existingmemory fence instructions could be used to ensure that all vector elements have been read/written before any other memory operation executes.
I'd really love to discuss the possibility of adding these instructions to future AVX extensions. I believe they could revolutionize CPU performance in the next decade, by potentially increasing performance 8-fold for loops that previously processed just one stream element per iteration...
I found some more sorely missing instructions:
Conversion between 32-bit unsigned integer and single-precisionfloating-point (both ways). Currently this would require to first zero extend the integers to 64-bit, convert two vectors to double-precision floating-point, convert to single-precision floating-piont, and then pack them back into one vector. This defeats the purpose of (automatic) vectorization.It should be very easy to implement in hardware as a variation of cvtdq2ps/cvtps2dq.
Your suggestions are very insightful. Maybe we should add a way to collect, rate, and track this topic to a public forum. Do you think many developers would participate?
Some developers have asked that we make the instruction set completely orthogonal, so that gaps like the unsigned int to float conversion would not exist. There is a strong desire to enable vectorizing compilers (and I think that SSE4.1 and AVX both reflect this), but the process continues to involve a cost/benefit analysis. Given this, do you have an example of a common usage of the unsigned int to float conversion that, if accelerated, wouldrepresent a noticeable application-level performance improvement?
I think it would indeed be very useful to discuss this with a wider audience of developers. The decisions made now can have a very big influence on the future of high-performance computing on the PC platform. Multi-core and SIMD advancements are totally reshaping software development for these applications...
In my humble opinion it is therefore crucial that there is a focus on productivity and a lasting synergy between software and hardware. Complex things for which there is a limited transistor budget today could be cheap and straightforward to implement a few generations later. But it's important to have the software to hardware interface, the instructions, available as soon as possible. So yes, filling in the gaps of the vector instruction set to make it just as versatile as the scalar instruction set seems very important to me.
For the unsigned int to float conversion case I have a specific and a generic example. The specific example is the use of unsigned integers to represent values between 0.0 and 1.0. For 32-bit integers 00000000h represents 0.0 and FFFFFFFFh represents 1.0. This is very common in computer graphics as many values are strictly within the 0.0-1.0 range (color components, normalized screen coordinates, depth buffer data, etc.) while maximizing the resolution. Lots of (streaming) input data is encoded this way, using 8, 16 or 32 bits, or something in between. When unpacking the smaller data elements they often end up in the higher portion of the register elements (e.g. an 8-bit value representing 1.0 becomes FF000000h). Converting them to floating-point without having an unsigned conversion requires to first shift it back to the right. With 32-bit elements it's not an option to shift if no precision loss is allowed; they have to be converted to 64-bit first. So in all these cases an unsigned integer to float instruction would have a direct positive effect on performance. There are also specific applications in the DSP field.
The generic example is the automatic vectorization of loops. For whatever reason, the programmer may choose to use unsigned integers, even if the range of values could fit in a signed integer. He is unaware and quite frankly should not be aware of any performance implications this can have. So it is of great importance to vectorizing compilers that there is no discrimination in performance for all the popular data types (or at least no significant difference relative to the scalar case). A straightforward translation of scalar code to vector code should be guaranteed to be faster.
And I'm not sure if we should actually always question whether or not something already has many uses today, or whether the cost/benefit is positive. Instruction extensions last for decades but are added in small increments, and for software developers it's a chicken and egg problem. If they want a certain instruction but it's not avialable they come up with an adequate alternative. After a while they become so used to the alternative that they hardly feel a need for the instruction any more. In my opinion adequate alternatives are not good enough any more if we want to guarantee a thriving future for x86 CPUs. The instruction set is either complete or it's not, and I see little reason to keep making tiny additions if it is possible to add it all at once. Like I said, the first implementation doesn't need to have a lot of die space dedicated to it. Programmers first need the interface. For the case of gather/scatter the first implementation could even use microcode to sequentially load/store each element (this saves instruction space so it's already a win). If it turns out these instructions are indeed used a lot, multiple parallel load/store units could be used in later implementations. I realize that opcode space shouldn't be wasted either, but sooner or later we need these instructions anyway to make progress...
Sorry for the rant. I just think the many extensions we've seen from MMX to SSE4 have caused quite a few headaches for developers, which only led to very slow adoption. It took five generations of SIMD extensions to add the 'packusdw' instruction... and only recently do I start to see compilers using SSE2 by default and applications depending on it. I just hope AVX or its direct successor will be 'done right' and include the vector equivalent of all regular scalar operations sooner rather than later, and I look forward to see these instructions become faster and faster every generation.
Eric, may I suggest something?
Intel should create a new instruction submission page. People could suggest instruction, enter their rationale, and attach example source code.
You could then evaluate the performance of the new instruction in an emulator, and you could rate those suggestions by performance benefit and ease of implementation, while developers would rate them by usefullness.
Sort of a double voting system where you would get an insight on what the developers consider priority for inclusion, and at the same time the developers could see the actual benefit of those instructions they suggested based on your emulation tests.
What do you think?
- universal shuffling with variable selectors
- universal blending with variable selectors
- universal shift and rotate of bytes with variable counts
- parallel table lookup
It can also solve the problem that PALIGNR has a constant shift count. PALIGNR is needed for any operation on two arrays that are misaligned relative to each other. If the misalignment is not known at compile time then you need 16 branches for each of the possible shift counts of PALIGNR. This is currently done in the most efficient implementations of the memcpy function. PPERM can be used as a PALIGNR with variable shift count so that we don't need 16 branches.
Parrallel table lookup is also an important application. Table lookup is often used in optimized code, but this is an obstackle to vectorization. PPERM can do 16 parrallel lookups in a table of up to 32 entries. With YMM registers we can have 64 entries in the table, and we can only look forward to further extensions like ZMM or whatever providing still larger tables. Table lookup is used in encryption and text processing. Tables with 32-bit or 64-bit elements could be useful in math and many other applications.
The SSE5 PPERM uses the extra selector bits for optionally inverting or reversing bits or setting to zero. However, this does not fit the extension of register sizes to YMM. The vacant selector bits should be reserved for future extensions of XMM to YMM, ZMM, etc.
Finally, regarding the proposed forum for public discussion of new instructions. Such a forum would be a very good idea. The current situation where Intel and AMD are competing to invent new instructions and keeping their intentions secret is only producing mutually incompatible instructions, some of which are not very useful. A public forum would probably result in a large number of proposals, and it would be up to public discussion to decide which proposals are most useful.