Intel® ISA Extensions
Use hardware-based isolation and memory encryption to provide more code protection in your solutions.
1104 Discussions

## AVX vs. SSE: expect to see a larger speedup

Employee
1,261 Views

What type of speedup is AVX expected to have over SSE for vectorized addition?  I would expect a 1.5x to 1.7x speedup.
There are more details in this stackoverflow post if someone wants points.

// add int vectors via AVX

__attribute__((noinline))

void add_iv_avx(__m256i *restrict a, __m256i *restrict b, __m256i *restrict out, int N) {

__m256i *x = __builtin_assume_aligned(a, 32);

__m256i *y = __builtin_assume_aligned(b, 32);

__m256i *z = __builtin_assume_aligned(out, 32);

const int loops = N / 8; // 8 is number of int32 in __m256i

for(int i=0; i < loops; i++) {

}

}

// add int vectors via SSE; https://en.wikipedia.org/wiki/Restrict

__attribute__((noinline))

void add_iv_sse(__m128i *restrict a, __m128i *restrict b, __m128i *restrict out, int N) {

__m128i *x = __builtin_assume_aligned(a, 16);

__m128i *y = __builtin_assume_aligned(b, 16);

__m128i *z = __builtin_assume_aligned(out, 16);

const int loops = N / sizeof(int);

for(int i=0; i < loops; i++) {

//out= _mm_add_epi32(a, b); // this also works

}

}

3 Replies
Honored Contributor III
1,261 Views

>>What type of speedup is AVX expected to have over SSE for vectorized addition?

That statement is bottlenecked by memory bandwidth (and/or cache fetch latency)

Try something more register based compute intensive. It doesn't have to be a real test, just something to illustrate adds via registers.

```    const int loops = N / 8; // 8 is number of int32 in __m256i
for(int i=0; i < loops; i++) {
__m256i xReg = x;
__m256i yReg = y;
_mm256_store_si256(&z, tempReg);
}
```

The above is a silly example of having 5 register to register instructions between two loads and one store. Do the same for the 128 bit version and compare the difference.

Jim Dempsey

Honored Contributor III
1,261 Views

A convenient trick to eliminate the memory bottleneck in these sorts of tests is to compile your code to assembly language, then modify the indexing on the array references (in the assembly code) to eliminate the offsets.  Recompile using the .s files.  With this change, every memory reference will go to the base of the corresponding array, which will be in the cache after the first time through the loop.

Honored Contributor III
1,261 Views

That is fine when you want to determine how well the code can run... provided you get the data into L1 cache.
If you want to measure the instruction latency and/or throughput, then you need to construct a different test.

.OR.

You can rely on the work of others: http://www.agner.org/optimize/instruction_tables.pdf

Jim Dempsey