Some time ago BOINC project RakeSearch released their software as an open source. I took it and I am trying to optimize it now. During my work I got an idea of "Array Registers". The idea is simple: registers are faster than memory, so small array placed in dedicated Array Register should be faster than one stored in memory. Intel CPUs already have XMM/YMM/ZMM registers, which looks like a natural candidate to become Array Registers. However they would have to support additional operations. At a minimum programs should be able to get and set single element at index which is non-const:
array_get(arr_reg, index): return arr_reg[index]; array_set(arr_reg, index, value): arr_reg[index] = value;
Various dedicated arithmetic and logic operations also would be helpful to speed up things, e.g. addition:
array_set(arr_reg, index, value): arr_reg[index] = arr_reg[index] + value;
Next step is to create vector version of this. "Get" is just a shuffle. With "Set" you have to deal with possible duplicated indices - when they are unique, this is shuffle too; when there are duplicates, you have data loss. Overall "Set" operation does not look very useful. Move helpful would be various arithmetic and logic operations, but of course they would be much harder to implement. Intel engineers probably discussed something like this a lot, as they decided to created AVX512CD ISA.
These new instructions would help HPC by allowing to speedup code which cannot be vectorized, There are also tons of various legacy code which depends on optimizations provided by compiler only, it also would benefit from this.
What you are essentially asking to do is to be able to perform scalar math amongst arbitrary lanes within zmm vector registers, including the location/lane within the destination register, while using variable selector(s).
Generally HPC applications process very large datasets as opposed to say 512 floats that can fit within the 32 x 16-wide registers. Unless you keep the entire working set within these registers, you are likely to have a zero gain, and possibly negative gain. I one can conceptualize an example that shows benefit, but in a general case, it would be hard to show benefit.
For small array to small array operations (those that both inputs and output can reside in registers), the latency of an isolated operation might demonstrate improvement, however in HPC programming, the issue is on throughput of vast quantities of these operations. Well coded programs attain better throughput by overlapping the memory and/or L3 and/or L2 and/or L1 fetch/store operations with expression evaluations. IOW it will not matter what the latency is as long as the computations fit within the fetch/store latencies of the encapsulating operations.
This said, there may be some use in identifying specific operations that are widely in use in HPC applications. An example:
Rotation of a 3d vector by use of matrix multiplication of 3x3 translation matrix (as well as its transposition),..
Rotation of a 3d vector by use of matrix multiplication of 4x4 (quaternion) translation matrix (as well as its transposition)..
The rotation matrix (3x3 or 4x4) could be loaded into 1 (float) or 2 (double) adjacent registers, which could be fixed, and the 3d vectors could be loaded into arbitrary registers (or for that matter be drawn from memory, but I suspect it be better fetching from memory to register in outer loop would be more efficient).
IOW the rotation matrix is loaded once, then an array of 3D vectors can be rotated using unaligned masked loads, new instruction matrix multiply, masked store. (memory fetch and store optimize by pre-loop and post-loop code).