- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It is not unusual to allocate contiguous memory for an N-dimensional array. In FORTRAN 2D array for example Array(1:nX, 1:nY) you can construct a reference to Array(1:nX, y) which is a contiguous block of memory, as well as construct a reference Array(x, 1:nY) which creates an array descriptor with a stride other than one element of nX*element size. C/C++ can have analogous capability using a class.
Could you consider extending the AVX instruction set to have an alternate form of scatter/gather that takes a stride as opposed to the current table of indices?
Current form:
FOR j = 0 to 7
i= j * 32;
IF MASK[31+i] THEN
MASK[i +31:i] 0xFFFFFFFF; // extend from most significant bit
ELSE
MASK[i +31:i] 0;
FI;
ENDFOR
FOR j =0 to 7
i= j * 32;
DATA_ADDR= BASE_ADDR + (SignExtend(VINDEX1[i+31:i])*SCALE) + DISP;
IF MASK[31+i] THEN
DEST[i +31:i] FETCH_32BITS(DATA_ADDR); // a fault exits the loop
FI;
MASK[i +31:i] 0;
ENDFOR
Proposed alternate form:
FOR j = 0 to 7
i= j * 32;
IF MASK[31+i] THEN
MASK[i +31:i]= 0xFFFFFFFF; // extend from most significant bit
ELSE
MASK[i +31:i]= 0;
FI;
ENDFOR
FOR j = 0 to 7
i= j * 32;
DATA_ADDR= BASE_ADDR + (SignExtend(VINDEX1 * j)*SCALE) + DISP;
IF MASK[31+i] THEN
DEST[i +31:i]= FETCH_32BITS(DATA_ADDR); // a fault exits the loop
FI;
MASK[i +31:i]= 0;
ENDFOR
Where VINDEX1 now contains the stride (as opposed to address of table)
And it is the programmer/compiler responsibility to insert into BASE_ADDR the base address of the small vector iow the 0th element of the 8 floats.
This would provide for more efficient code in scatter/gather
Jim Dempsey
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
strided loads will be achieved easily with the planned instructions
using 8 x FP32 gather as example
VGATHERDPS result, (base,indices,scale), mask
indices = 7*stride | 6*stride | 5*stride | 4*stride | 3*stride | 2*stride | stride | 0
scale = 4
the indiceswill be typicallyconstant so the only thing to dowill be a256-bit movefrom the hot data to theindices register, typically out of critical loop so without impact on performance
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
However once there are scatter/gather instructions acting on bytes or words (see my proposal), such command will be needed and useful since the needed offsets might not fit into bytes resp. words.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
if the range for the scale is increasedthe current versatile instruction will do the trick also for 8-bit indices, a typical coding pattern will then be :
for FP32 :
indices = 7|6|5|4|3|2|1|0
scale = stride*4
for bytes:
indices = 31|30|..|2|1|0
scale = stride
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The strides might be much greater not fitting in bytes or words.
If e.g. you want to gather a vertical line from an image which is line-wise stored as a 2d-array the stride will be the line size as was the original question. The image might contain an 8 bit value per pixel (gray-scale or indexes into a palette).
I'm well aware that byte/word scatter/gather is not planned - yet.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
yes, sure, that's why I suggested to increase the rangefor scales (max scale = 8 at the moment), if the scale is any 32-bit value for example you will have the same wide address range than with a fixed 32-bit stride (also with 8-bit indices) butenjoyingthe versatiliity offered by the vector indexed addressing
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
A fixed factor (like 1,2,4,8) does not help, hence we need a variable one.
The indexes need not be ascending from 0 up but could be stored in the corresponding elements of an SSE/AVX register.
What we need is a scatter/gather command like this:
gather.word result, indexes, base_addr, stride
result: SSE/AVX register. Could optionally be equal to indexes
indexes: SSE/AVX register containing the indexes (signed or unsigned)
base_addr: effective address
stride: general purpose register such as ecx or rax
Example:
gather.word ymm0,ymm1,[rsi],ecx
Functionality (words):
for all i do
result.word := word ptr [EA + indexes.word*stride]
The stride parameter would be an extension to my original proposal of byte/word scatter/gather.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
However, in looking at the proposed method, the stride is uniform for all elements. Where element can be byte, word, dword/float, qword/double, ... (any element size supported by SSE/AVX) Further, if the stride were sourced from a GP register (as opposed to SSE/AVX register (table)) this would free up one SSE/AVX register and the time to load it.
for all i do
if mask.byte
result.byte = byte ptr [EA + i*stride*scale(1) + disp]
if GP fault exit loop
mask.byte = 0
fi
rof
// repete sequence for other element sizes
for all i do
...
result.word =word ptr [EA + i*stride*scale(2) + disp]
...
for all i do
...
result.dword =dword ptr [EA + i*stride*scale(4) + disp]
...
for all i do
...
result.qword =qword ptr [EA + i*stride*scale(8) + disp]
...
For ease in firmware if you unroll the loop
strideScale = stride * scale
ea = EA + disp
result.dword[0] = qword ptr [ea]
ea = ea + strideScale
result.dword[1] = qword ptr [ea]
ea = ea + strideScale
result.dword[2] = qword ptr [ea]
...
however you want to view it.
Essentially the index is scaled once (no change in SIB decoding), the result though is then incrimentally added to the effective address for each potential gather/scatter. This same format could be extended down to bit level and up to double double (REAL(16)).
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Of course EA can be anything a EA always is allowed to, e.g. something like rbx*8+rdi+12345 - and of course even if bytes are accessed...
Your newly introduced mask seems nice but must be specified; using the highest bit of the "indexes" register is not the best idea. What have you thought?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I don't think additional instructions would help. You need to realize that the circuitry to support word or byte gather would be very complex, meaning there's less area for other features, it would consume a considerable amount of power, and it would be slow. So personally I'd much rather just get efficient implementations of the proposed dword and qword gather instructions. More often than not you can reorder your data and perform some in-register permutations if you have to operate on word or byte size elements anyway.
In other words, dword and qword gather performance should not be compromised for the sake of programming convenience!
Likewise, strided loads are redundant when you have fully generic gather instructions. I don't think it can be supported in hardware any more efficiently than in software.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page