Hello, I would like to make a suggestion
Very often [otherwise well vectorizible] algorithms require reading/writing from/to mem addresses which are calculated per-channel (reading from table, sampling a texture, etc.).
When you get to this, you are forced to make that part of the algorithm scalar by extracting each channel in turn to a GP register, performing the memory operation and then inserting the result back to a vector register.
I don't think a single instruction that interprets each channel as an address and reads/writes to different memory locations at once is hardware feasible (though it would be extremely good) but at least we could have something that would ease the situation.
my suggestion is instructions for memory access that get the address directly from the sse/avx register:
loadd $(i + (j<<4)), %xmm0, %xmm1 - read 32-bit word from address specified in the i-th dword of xmm0 and store it in j-th quarter of xmm1
stored $(i + (j<<4)), %xmm0, %xmm1 - read 32-bit word from j-th quarter of xmm1 and store it to address specified in the i-th dword of xmm0
+ variants for 64-bit addresses and other data sizes (loadw, loadq, loaddq, storew, storeq, storedq), etc.
something like that, you get the idea :)
Depending on your processor you may have scatter/gather capability. These instructions use a small vector of indicies (off a common base address). You can also mask which indicies are used. AVX2 has as an example:
__m256i _mm256_mask_i32gather_epi32 (__m256i src, int const* base_addr, __m256i vindex, __m256i mask, const int scale)
CPUID Feature Flag: AVX2
Gather 32-bit integers from memory using 32-bit indices. 32-bit elements are loaded from addresses starting at base_addr and offset by each 32-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst using mask (elements are copied from src when the highest bit is not set in the corresponding element). scale should be 1, 2, 4 or 8.
FOR j := 0 to 7
i := j*32
dst[i+31:i] := MEM[base_addr + SignExtend(vindex[i+31:i])*scale]
mask[i+31] := 0
dst[i+31:i] := src[i+31:i]
mask[MAX:256] := 0
dst[MAX:256] := 0
Thank you for your reply!
Very interesting instructions, i didn't know about them.
I have a little trouble to figure out what exactly does this vm32x mean in the documentation (the second operand)
but it looks like it is some unique and very special case only for these instructions.
I don't see variants for storing to memory but the reading is much more important.
i think these are quite nice instructions. unfortunately AVX2 is a bit too high requirement for the time being :(
The Xeon Phi ISA has instructions with this sort of functionality, such as VGATHERDPD and VSCATTERDQ.
These are described in the Xeon Phi ISA specification: http://software.intel.com/sites/default/files/forum/278102/327364001en.pdf
For insertion or extraction of a single element, these work directly. To gather or scatter multiple elements to/from a vector register the operation must be put in a loop (since the hardware can only extract or insert values from/to a single cache line in each execution of the instruction). The hardware is capable of gathering/scattering multiple elements from/to a single cache line, so it does not necessarily need to be executed once for each element.