Which AVX memory access pattern is better?

sun__lei · ‎10-28-2019

For example, there are an array A. it’s length is length_A. Using AVX gather(_mm256_i32gather_i32) function to read array A. There are two memory access pattern.

1.

mm256 register = (A[0], A[1],….A[7])

mm256 register = (A[8], A[9],….A[15]),,,and so on

2.

stride = length_a /8;

mm256 register = (A[0], A[stride+0],….A[7*stride+0])

mm256 register = (A[1], A[stride+1],….A[7*stride+1]),,,and so on

which is better when length_A is very large?

jimdempseyatthecove · ‎10-29-2019

Why not test both ways, and report back.

Your actual program is likely much more complex than simply streaming through memory performing loads. One simply cannot look at optimizing two statements without looking at the whole of the problem.

The fastest car on a drag strip is not necessarily the fastest car in a formula 1 race, or fastest car on a hill climb.

Jim Dempsey

andysem · ‎10-29-2019

Ignoring all other things, in general, gathering instructions are much more expensive than regular loads. You have to have serious reasons to use gather approach - e.g. when regular vector loads are not possible or using them would require too much data shuffling that would outweigh gather overhead.

sun__lei · ‎10-29-2019

andysem wrote:
Ignoring all other things, in general, gathering instructions are much more expensive than regular loads. You have to have serious reasons to use gather approach - e.g. when regular vector loads are not possible or using them would require too much data shuffling that would outweigh gather overhead.

So if the memory addresses hit in the cache, will the gather instruction be better than regular load?

andysem · ‎10-30-2019

sun, lei wrote:
So if the memory addresses hit in the cache, will the gather instruction be better than regular load?

Again, all other aspects aside, gather instructions are more expensive. Whether that overhead would be offset by specifics of your case, is another question.

jimdempseyatthecove · ‎10-30-2019

Consider your methods 1 and 2 in post #1

Under the best of cases where the entire array A can fit in L1 cache, and upon the 2nd iteration of a loop, the entire array in fact lies within L1 cache, then:

1.

mm256 register = (A[0], A[1],….A[7]); // 4 clocks to fetch from L1
mm256 register = (A[8], A[9],….A[15]); // +4 clocks to fetch from L1
,,,and so on

2.

stride = length_a /8;
mm256 register = (A[0], A[stride+0],….A[7*stride+0]); // 4 * 8 = 32 clocks to fetch from L1 (8 fetches from L1)

mm256 register = (A[1], A[stride+1],….A[7*stride+1]); // +(4 * 8 = 32) clocks to fetch from L1
,,,and so on

*** As stated earlier the ",,,and so on" may have a significant affect on the choice of best method.

Structuring for gather/scatter should only be considered when there is no other choice. e.g. processing a column from multiple rows in an array or collecting arbitrary elements in an array. Note, these gather (or scatter) operations do not save latencies for fetch or store of the data, but instead they may potentially save some instruction fetch/execute time. However, due to instruction pipelining, the fetch latency between lanes can be used to recover any/most/all fetch/execute time of multiple instructions.

Additionally. Each core processor has a limited number of TLBs (Translation Look-aside Buffers). These are a different type of cache (not Ln) that assist in the mapping of Virtual Memory space to Physical Memory space via the process page tables. Depending on CPU model between 32 and 512 TLB's may be available. A TLB hit is between 0.5 to 1 clock cycle, and a TLB miss may be between 10 and 100 additional cycles. Depending upon the size of your array, those two instructions alone could consume 8 of those TLBs. How many more TLBs will your loop require should you sparsely access your data?

Jim Dempsey