- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
For example, there are an array A. it’s length is length_A. Using AVX gather(_mm256_i32gather_i32) function to read array A. There are two memory access pattern.
1.
mm256 register = (A[0], A[1],….A[7])
mm256 register = (A[8], A[9],….A[15]),,,and so on
2.
stride = length_a /8;
mm256 register = (A[0], A[stride+0],….A[7*stride+0])
mm256 register = (A[1], A[stride+1],….A[7*stride+1]),,,and so on
which is better when length_A is very large?
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Pattern (1) looks like normal contiguous storage, so it should not use gather instructions? Ordinary loads should take 0.5 cycles each on AVX2-capable processors (assuming data in cache).
According to https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=gather&expand=2980 the "__mm256_i32gather_epi32" instruction requires ~10 cycles on Haswell, ~6 cycles on Broadwell, and ~5 cycles on Skylake. Those are presumably "best case" values for data in cache. Agner Fog's instruction tables show similar values, with 12 cycles for Haswell, 7 cycles for Broadwell, and 5 cycles on Skylake (client or server).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
McCalpin, John (Blackbelt) wrote:Pattern (1) looks like normal contiguous storage, so it should not use gather instructions? Ordinary loads should take 0.5 cycles each on AVX2-capable processors (assuming data in cache).
According to https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=gathe... the "__mm256_i32gather_epi32" instruction requires ~10 cycles on Haswell, ~6 cycles on Broadwell, and ~5 cycles on Skylake. Those are presumably "best case" values for data in cache. Agner Fog's instruction tables show similar values, with 12 cycles for Haswell, 7 cycles for Broadwell, and 5 cycles on Skylake (client or server).
Thanks for your reply. I want to know the case when data is not in the cache and data is in the memory. I want to know whether avx gather will improve the performance when randomly read memory compared to use normal scalar memory read the same memory locations.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have not tested the AVX2 gather instructions for main memory access. For indices that have a large random component, it is typically better to run a loop with software prefetches, but these usually require some manual tuning. None of the Intel processors have a native capability of transposing data across SIMD registers, but if the data is coming from main memory the overhead of transposing the data in registers may be tolerable.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page