AVX vs. SSE: expect to see a larger speedup

Anuj_G_Intel · ‎11-05-2017

What type of speedup is AVX expected to have over SSE for vectorized addition? I would expect a 1.5x to 1.7x speedup.
There are more details in this stackoverflow post if someone wants points.
https://stackoverflow.com/questions/47115510/avx-vs-sse-expect-to-see-a-larger-speedup
-AG

// add int vectors via AVX

__attribute__((noinline))

void add_iv_avx(__m256i *restrict a, __m256i *restrict b, __m256i *restrict out, int N) {

__m256i *x = __builtin_assume_aligned(a, 32);

__m256i *y = __builtin_assume_aligned(b, 32);

__m256i *z = __builtin_assume_aligned(out, 32);

const int loops = N / 8; // 8 is number of int32 in __m256i

for(int i=0; i < loops; i++) {

_mm256_store_si256(&z, _mm256_add_epi32(x, y));

}

// add int vectors via SSE; https://en.wikipedia.org/wiki/Restrict

__attribute__((noinline))

void add_iv_sse(__m128i *restrict a, __m128i *restrict b, __m128i *restrict out, int N) {

__m128i *x = __builtin_assume_aligned(a, 16);

__m128i *y = __builtin_assume_aligned(b, 16);

__m128i *z = __builtin_assume_aligned(out, 16);

const int loops = N / sizeof(int);

for(int i=0; i < loops; i++) {

//out= _mm_add_epi32(a, b); // this also works

_mm_storeu_si128(&z, _mm_add_epi32(x, y));

}

jimdempseyatthecove · ‎11-05-2017

>>What type of speedup is AVX expected to have over SSE for vectorized addition?
>>_mm256_store_si256(&z, _mm256_add_epi32(x, y));

That statement is bottlenecked by memory bandwidth (and/or cache fetch latency)

You have a: load, load, add, store.

Try something more register based compute intensive. It doesn't have to be a real test, just something to illustrate adds via registers.

    const int loops = N / 8; // 8 is number of int32 in __m256i
    for(int i=0; i < loops; i++) { 
        __m256i xReg = x;
        __m256i yReg = y;
        __m256i tempReg = _mm256_add_epi32(xReg, yReg); // add once
        __m256i tempReg = _mm256_add_epi32(tempReg, _mm256_add_epi32(xReg, yReg)); // add 2nd and 3rd
        __m256i tempReg = _mm256_add_epi32(tempReg, _mm256_add_epi32(xReg, yReg)); // add 4th and 5th
        _mm256_store_si256(&z, tempReg);
    }

The above is a silly example of having 5 register to register instructions between two loads and one store. Do the same for the 128 bit version and compare the difference.

Jim Dempsey

McCalpinJohn · ‎11-06-2017

A convenient trick to eliminate the memory bottleneck in these sorts of tests is to compile your code to assembly language, then modify the indexing on the array references (in the assembly code) to eliminate the offsets. Recompile using the .s files. With this change, every memory reference will go to the base of the corresponding array, which will be in the cache after the first time through the loop.

jimdempseyatthecove · ‎11-06-2017

That is fine when you want to determine how well the code can run... provided you get the data into L1 cache.
If you want to measure the instruction latency and/or throughput, then you need to construct a different test.

.OR.

You can rely on the work of others: http://www.agner.org/optimize/instruction_tables.pdf

Jim Dempsey