It is not unusual to allocate contiguous memory for an N-dimensional array. In FORTRAN 2D array for example Array(1:nX, 1:nY) you can construct a reference to Array(1:nX, y) which is a contiguous block of memory, as well as construct a reference Array(x, 1:nY) which creates an array descriptor with a stride other than one element of nX*element size. C/C++ can have analogous capability using a class.
Could you consider extending the AVX instruction set to have an alternate form of scatter/gather that takes a stride as opposed to the current table of indices?
Current form:
FOR j = 0 to 7
i= j * 32;
IF MASK[31+i] THEN
MASK[i +31:i] 0xFFFFFFFF; // extend from most significant bit
ELSE
MASK[i +31:i] 0;
FI;
ENDFOR
FOR j =0 to 7
i= j * 32;
DATA_ADDR= BASE_ADDR + (SignExtend(VINDEX1[i+31:i])*SCALE) + DISP;
IF MASK[31+i] THEN
DEST[i +31:i] FETCH_32BITS(DATA_ADDR); // a fault exits the loop
FI;
MASK[i +31:i] 0;
ENDFOR
Proposed alternate form:
FOR j = 0 to 7
i= j * 32;
IF MASK[31+i] THEN
MASK[i +31:i]= 0xFFFFFFFF; // extend from most significant bit
ELSE
MASK[i +31:i]= 0;
FI;
ENDFOR
FOR j = 0 to 7
i= j * 32;
DATA_ADDR= BASE_ADDR + (SignExtend(VINDEX1 * j)*SCALE) + DISP;
IF MASK[31+i] THEN
DEST[i +31:i]= FETCH_32BITS(DATA_ADDR); // a fault exits the loop
FI;
MASK[i +31:i]= 0;
ENDFOR
Where VINDEX1 now contains the stride (as opposed to address of table)
And it is the programmer/compiler responsibility to insert into BASE_ADDR the base address of the small vector iow the 0th element of the 8 floats.
This would provide for more efficient code in scatter/gather
Jim Dempsey
Link Copied
strided loads will be achieved easily with the planned instructions
using 8 x FP32 gather as example
VGATHERDPS result, (base,indices,scale), mask
indices = 7*stride | 6*stride | 5*stride | 4*stride | 3*stride | 2*stride | stride | 0
scale = 4
the indiceswill be typicallyconstant so the only thing to dowill be a256-bit movefrom the hot data to theindices register, typically out of critical loop so without impact on performance
indices = 31|30|..|2|1|0
scale = stride
I don't think additional instructions would help. You need to realize that the circuitry to support word or byte gather would be very complex, meaning there's less area for other features, it would consume a considerable amount of power, and it would be slow. So personally I'd much rather just get efficient implementations of the proposed dword and qword gather instructions. More often than not you can reorder your data and perform some in-register permutations if you have to operate on word or byte size elements anyway.
In other words, dword and qword gather performance should not be compromised for the sake of programming convenience!
Likewise, strided loads are redundant when you have fully generic gather instructions. I don't think it can be supported in hardware any more efficiently than in software.
For more complete information about compiler optimizations, see our Optimization Notice.