- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
What type of speedup is AVX expected to have over SSE for vectorized addition? I would expect a 1.5x to 1.7x speedup.
There are more details in this stackoverflow post if someone wants points.
https://stackoverflow.com/questions/47115510/avx-vs-sse-expect-to-see-a-larger-speedup
-AG
// add int vectors via AVX
__attribute__((noinline))
void add_iv_avx(__m256i *restrict a, __m256i *restrict b, __m256i *restrict out, int N) {
__m256i *x = __builtin_assume_aligned(a, 32);
__m256i *y = __builtin_assume_aligned(b, 32);
__m256i *z = __builtin_assume_aligned(out, 32);
const int loops = N / 8; // 8 is number of int32 in __m256i
for(int i=0; i < loops; i++) {
_mm256_store_si256(&z, _mm256_add_epi32(x, y));
}
}
// add int vectors via SSE; https://en.wikipedia.org/wiki/Restrict
__attribute__((noinline))
void add_iv_sse(__m128i *restrict a, __m128i *restrict b, __m128i *restrict out, int N) {
__m128i *x = __builtin_assume_aligned(a, 16);
__m128i *y = __builtin_assume_aligned(b, 16);
__m128i *z = __builtin_assume_aligned(out, 16);
const int loops = N / sizeof(int);
for(int i=0; i < loops; i++) {
//out= _mm_add_epi32(a, b); // this also works
_mm_storeu_si128(&z, _mm_add_epi32(x, y));
}
}
- Tags:
- Intel® Advanced Vector Extensions (Intel® AVX)
- Intel® Streaming SIMD Extensions
- Parallel Computing
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>What type of speedup is AVX expected to have over SSE for vectorized addition?
>>_mm256_store_si256(&z, _mm256_add_epi32(x, y));
That statement is bottlenecked by memory bandwidth (and/or cache fetch latency)
You have a: load, load, add, store.
Try something more register based compute intensive. It doesn't have to be a real test, just something to illustrate adds via registers.
const int loops = N / 8; // 8 is number of int32 in __m256i for(int i=0; i < loops; i++) { __m256i xReg = x; __m256i yReg = y; __m256i tempReg = _mm256_add_epi32(xReg, yReg); // add once __m256i tempReg = _mm256_add_epi32(tempReg, _mm256_add_epi32(xReg, yReg)); // add 2nd and 3rd __m256i tempReg = _mm256_add_epi32(tempReg, _mm256_add_epi32(xReg, yReg)); // add 4th and 5th _mm256_store_si256(&z, tempReg); }
The above is a silly example of having 5 register to register instructions between two loads and one store. Do the same for the 128 bit version and compare the difference.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
A convenient trick to eliminate the memory bottleneck in these sorts of tests is to compile your code to assembly language, then modify the indexing on the array references (in the assembly code) to eliminate the offsets. Recompile using the .s files. With this change, every memory reference will go to the base of the corresponding array, which will be in the cache after the first time through the loop.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
That is fine when you want to determine how well the code can run... provided you get the data into L1 cache.
If you want to measure the instruction latency and/or throughput, then you need to construct a different test.
.OR.
You can rely on the work of others: http://www.agner.org/optimize/instruction_tables.pdf
Jim Dempsey

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page