Quote:McCalpin, John

sun__lei · ‎10-28-2019

For example, there are an array A. it’s length is length_A. Using AVX gather(_mm256_i32gather_i32) function to read array A. There are two memory access pattern.

1.

mm256 register = (A[0], A[1],….A[7])

mm256 register = (A[8], A[9],….A[15]),,,and so on

2.

stride = length_a /8;

mm256 register = (A[0], A[stride+0],….A[7*stride+0])

mm256 register = (A[1], A[stride+1],….A[7*stride+1]),,,and so on

which is better when length_A is very large?

McCalpinJohn · ‎10-28-2019

Pattern (1) looks like normal contiguous storage, so it should not use gather instructions? Ordinary loads should take 0.5 cycles each on AVX2-capable processors (assuming data in cache).

According to https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=gather&expand=2980 the "__mm256_i32gather_epi32" instruction requires ~10 cycles on Haswell, ~6 cycles on Broadwell, and ~5 cycles on Skylake. Those are presumably "best case" values for data in cache. Agner Fog's instruction tables show similar values, with 12 cycles for Haswell, 7 cycles for Broadwell, and 5 cycles on Skylake (client or server).

sun__lei · ‎10-28-2019

McCalpin, John (Blackbelt) wrote:
Pattern (1) looks like normal contiguous storage, so it should not use gather instructions? Ordinary loads should take 0.5 cycles each on AVX2-capable processors (assuming data in cache).
According to https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=gathe... the "__mm256_i32gather_epi32" instruction requires ~10 cycles on Haswell, ~6 cycles on Broadwell, and ~5 cycles on Skylake. Those are presumably "best case" values for data in cache. Agner Fog's instruction tables show similar values, with 12 cycles for Haswell, 7 cycles for Broadwell, and 5 cycles on Skylake (client or server).

Thanks for your reply. I want to know the case when data is not in the cache and data is in the memory. I want to know whether avx gather will improve the performance when randomly read memory compared to use normal scalar memory read the same memory locations.

McCalpinJohn · ‎10-29-2019

I have not tested the AVX2 gather instructions for main memory access. For indices that have a large random component, it is typically better to run a loop with software prefetches, but these usually require some manual tuning. None of the Intel processors have a native capability of transposing data across SIMD registers, but if the data is coming from main memory the overhead of transposing the data in registers may be tolerable.

Which AVX memory access pattern is better?