Showing results for 
Search instead for 
Did you mean: 

Migration of Legacy 128 bit SSE (AVX instructions for 128 bit integer operations)

Suppose there is a 128 bit (xmm) simdintrinisc_mm_add_epi16 (adds 8 16-bit integers). It is mentioned in the AVX programming reference that performance gain can be acheived if legacy 128 bit instructions can be processed in AVX mode (VEX.128).

2.8.2 Using AVX 128-bit Instructions Instead of Legacy SSE instructions

Applications using AVX and FMA should migrate legacy 128-bit SIMD instructions to their 128-bit AVX equivalents. AVX supplies the full complement of 128-bit SIMD instructions except for AES and PCLMULQDQ.

now the syntax of the add intrinsic is PADDB __m128i _mm_add_epi8 (__m128ia,__m128ib ). But it is mentioned that the AVX instruction is VPADDB.

How can the AVX version of this integer intrinsic be used (VPADDB)? Is there a seperate AVX intrinsic which can be used for the same?

is this the way to do it

__256i data1, data2, data3;

_mm256__mm256_zeroall ();
_mm256__mm256_zeroupper ();

data3 = _mm_add_epi8 ((__m128i) data1, (__m128i) data2);

Will this perform better than than the legacy 128 bit register and intinsic usage??


0 Kudos
3 Replies
Black Belt

If you are in a context where you would require frequent zeroupper, it would seem that sticking with SSE-128 might give better performance. Even if that is not the case, the advantage of AVX-128 over equivalent SSE-128 comes mainly when it economizes the number of micro-ops executed without increasing the number of data transfers from cache.
0 Kudos

Adding to what Tim said. In addition to instrunction length, Another benefit from AVX comes from third operand. e.g. if you have to add a = b+c, the number of instructions in AVX will be less.
tranditional SSE
movaps a, b
addps a, c

addps a, b, c

When you compile your code with arch:AVX switch, compiler will know that it has to emit AVX form of instruction for a instrinsic, if instrinsic is sharing the same name and semantics for AVX/SSE. if you are working in integer space and are not using 256bit registers, you dont need to define them __m256i.

You dont need to use VZEROALL, this instruction is mainly for OS and other specific task. Appliation should use VZEROUPPER only.

0 Kudos
New Contributor II

if you use your old 128 intrinsics code with /AVX compiler switch, they will be generated in the VEX form (3 operand). if youre code is all 128 bit, you don't need zeroupper/all
0 Kudos