SIMD operations on int8 (byte

GHui · ‎06-14-2017

I heard that int8 and FP16 from someone, but I don't know what it is.

TimP · ‎06-16-2017

Your web search engine will give plenty of useful answers. We can't guess what you might ask if you were to be specific. Intel platforms which support such data formats will widen them temporarily when performing arithmetic.

GHui · ‎06-19-2017

Does the PMU counter record them? If Intel platforms widen them, do they take use of SSE or AVX? And do they(int8, FP16) calculate much faster?

TimP · ‎06-20-2017

If there is speedup it would be from saving in memory bandwidth. A limited group of int8 operations would be available in sse? and avx2.

McCalpinJohn · ‎06-21-2017

SIMD operations on int8 (byte) variables are supported by MMX, SSE2, AVX, AVX2, and AVX512BW (not shipping yet).

There is pretty good support for addition/subtraction on packed byte operands:

unsigned add/subtract with wraparound,
signed add/subtract with saturation, and
unsigned add/subtract with saturation.

Bitwise logical operations don't require special versions for byte variables -- you just need to pick a SIMD boolean operation with the right register size. The same applies for loads and stores, of course.

Boolean operations (e.g., MIN/MAX) are supported for vectors of byte variables by SSE, SSE4_1, AVX2, and AVX512BW, while the bytewise SIMD "compare" operations (e.g., compare for equal, compare for greater than) are supported by MMX, SSE2, AVX, AVX2, and AVX512BW. There are additional AVX512BW instructions relating to converting the output of compare instructions between bit mask and SIMD register formats.

Shuffle operations on byte variables are supported by SSSE3, AVX, AVX2, and AVX512BW.

Blend operations on byte variables are supported by SSE4_1, AVX, and AVX2. The special cases of selecting the maximum or minimum byte values in each position of two SIMD values are supported by SSE (unsigned only), SSE4_1, AVX, AVX2, and AVX512BW.

Support for multiplication is trickier, since multiplication of two 1-byte variables produces a 2-byte result. There is a general instruction to multiply and add vectors of signed and unsigned bytes, truncated the result to a vector of sign-saturated bytes. This is supported in SSSE3, AVX, AVX2, and AVX512BW. There is also a specialized instruction to compute the (rounded) average of the corresponding unsigned bytes in two SIMD registers (SSE, SSE2, AVX, AVX2, AVX512BW).

There are a number of specialized operations available for SIMD vectors of byte variables as well. Some examples include:

PSIGNB -- changes sign of destination byte if source byte is negative, zeros destination byte if source byte is zero. (SSSE3, AVX, AVX2)
PABSB -- returns absolute value of each (signed) input byte in SIMD register. (SSSE3, AVX, AVX2, AVX512BW)
PSADBW -- computes differences of unsigned bytes in two SIMD registers, then horizontally adds the absolute values of those differences, returning a single 16-bit result. (SSE, SSE2, AVX, AVX2, AVX512BW)

My mind boggles at the number of transistors that are required to implement these infrequently-used instructions, but that is part of what makes this field continually challenging....

What is int8 and FP16?