We have some SSE code that is effectively trying to do 3D texturing from a volume dataset. Our datasets often have only 16-bits of information per voxel. Our inner loop calculates 8 indices into the volume data array and grabs 8 voxels based on those indices. I thought the new gather instructions would be ideal for this, but unfortunately they do not support loading sixteen bit quantities.
The ideal way to implement this in the inner loop would be to use an unfortunately non-existant 16-bit gather that grabs 8 16-bit quantities, sign or 0 extends them to 32-bits and slots them into the 8 32-bit slots of a 256 bit register.
int16_t *volume_data_base addr = ...;
indices = ...; // calculate 8 indices of 32-bits each
values = _mm256_i32gather_epi16(volume_data_base_addr, indices, 2);
... operate on the values ...
Unfortunately the _mm256_i32gather_epi16 does not exist. So my question is, what would be another way to approach this? I have thought of 3 strategies:
1) Use _mm256_i32gather_epi32
We would fetch an extra 16-bits that we do not want for each voxel, and then either mask off the extra bits or ignore them. For the above loop use this to gather the values:
values = _mm256_i32gather_epi32(volume_data_base_addr, indices, 2);
... mask off 16-bits we do not want with _mm256_and_si256
... or ignore the extra bits using epi16 bit instructions
Since the epi16 capable instructions are not complete we probably would have to mask off the extra 16 bits and use epi32 instructions. I am guessing this approach might not be that bad from a performance stand-point since the extra 16-bits will probably be in the L1 cache anyways so it will not really add to memory bandwidth.
2) Use non-gather scalar loads
This is the way we did it before gather was available. We would use avx to calculate 8 indices in parallel, but then just extract the 8 indices and do some scalar ops on the result. In theory could repack the result of 8 scalar loads into a ymm register but in many cases that did not actually increase performance because our calculations were so simple.
3) Turn our 16-bit volumes into 32-bit float volumes.
The existing gather instruction would work perfectly. We could do all our calculations in floating point which seems to be more efficient for avx2 in many cases. Downside is our volumes are very large (gigavoxels) and this would waste a lot of memory. And it would presumably require twice the memory bandwidth which is probably the killer.
The latency, from my measurements with each index referencing a different cacheline, is 15 clks for the XMM variants (in 64-bit) and 19 clks for the YMM variants (int 64-bit). There really is no advantage in latency to using the GATHER instructions vs doing it yourself with regular instructions. There might be an advantage to instruction density, decode and tokens.. (which would depend upon how these are micro-coded by Intel). As Agner says.. I'd look at changing how your data is stored.. maybe you can work with the instructions you have at hand then.