mem address directly from SSE/AVX register

Luchezar_B_ · ‎11-14-2013

Hello, I would like to make a suggestion

Very often [otherwise well vectorizible] algorithms require reading/writing from/to mem addresses which are calculated per-channel (reading from table, sampling a texture, etc.).
When you get to this, you are forced to make that part of the algorithm scalar by extracting each channel in turn to a GP register, performing the memory operation and then inserting the result back to a vector register.
I don't think a single instruction that interprets each channel as an address and reads/writes to different memory locations at once is hardware feasible (though it would be extremely good) but at least we could have something that would ease the situation.

my suggestion is instructions for memory access that get the address directly from the sse/avx register:

loadd $(i + (j<<4)), %xmm0, %xmm1 - read 32-bit word from address specified in the i-th dword of xmm0 and store it in j-th quarter of xmm1
stored $(i + (j<<4)), %xmm0, %xmm1 - read 32-bit word from j-th quarter of xmm1 and store it to address specified in the i-th dword of xmm0
+ variants for 64-bit addresses and other data sizes (loadw, loadq, loaddq, storew, storeq, storedq), etc.

something like that, you get the idea :)

jimdempseyatthecove · ‎11-14-2013

Depending on your processor you may have scatter/gather capability. These instructions use a small vector of indicies (off a common base address). You can also mask which indicies are used. AVX2 has as an example:

__m256i _mm256_mask_i32gather_epi32 (__m256i src, int const* base_addr, __m256i vindex, __m256i mask, const int scale)
#include "immintrin.h"
Instruction: vpgatherdd
CPUID Feature Flag: AVX2
Description
Gather 32-bit integers from memory using 32-bit indices. 32-bit elements are loaded from addresses starting at base_addr and offset by each 32-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst using mask (elements are copied from src when the highest bit is not set in the corresponding element). scale should be 1, 2, 4 or 8.
Operation
FOR j := 0 to 7
    i := j*32
    IF mask[i+31]
        dst[i+31:i] := MEM[base_addr + SignExtend(vindex[i+31:i])*scale]
        mask[i+31] := 0
    ELSE
        dst[i+31:i] := src[i+31:i]
    FI
ENDFOR
mask[MAX:256] := 0
dst[MAX:256] := 0

Jim Dempsey

Luchezar_B_ · ‎11-14-2013

Thank you for your reply!

Very interesting instructions, i didn't know about them.
I have a little trouble to figure out what exactly does this vm32x mean in the documentation (the second operand)
but it looks like it is some unique and very special case only for these instructions.
I don't see variants for storing to memory but the reading is much more important.

i think these are quite nice instructions. unfortunately AVX2 is a bit too high requirement for the time being :(

McCalpinJohn · ‎11-14-2013

The Xeon Phi ISA has instructions with this sort of functionality, such as VGATHERDPD and VSCATTERDQ.

These are described in the Xeon Phi ISA specification: http://software.intel.com/sites/default/files/forum/278102/327364001en.pdf

For insertion or extraction of a single element, these work directly. To gather or scatter multiple elements to/from a vector register the operation must be put in a loop (since the hardware can only extract or insert values from/to a single cache line in each execution of the instruction). The hardware is capable of gathering/scattering multiple elements from/to a single cache line, so it does not necessarily need to be executed once for each element.