mm256_shuffle_epi8

Ravi_K_ · ‎05-25-2015

HI, I am going through the documentation for _mm256_shuffle_epi8 https://software.intel.com/sites/products/documentation/doclib/iss/2013/compiler/cpp-lin/index.htm#GUID-0E477F94-9588-4A78-9381-0E2D08ED8E04.htm pseudo code shows only upto 16 bytes ... for (i = 0; i < 16; i++){ if (b & 0x80){ r = 0; } else { r = a[b & 0x0F]; } } ... Is there an updated document which explains for 32 bytes? Thanks.

Vladimir_Sedach · ‎05-25-2015

Ravi,

Download:
https://www-ssl.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-instruction-set-reference-manual-325383.pdf

and find the intrinsic (Ctrl-F3).

_mm256_shuffle_epi8() does high order 128-bit permutation using high order 128-bit of all parameters.
The method is same as for low 128-bit.

Ravi_K_ · ‎05-26-2015

Vladimir, Thanks for the reference. what I wanted to achieve using _mm256_shuffle_epi8, swap 0 - 31 1 - 30 2 - 29 ... 31 - 0 I tried _mm256_shuffle_epi8, doesn't seem to get it working. With your explanation, I think I am using it for wrong purpose. Any inputs on which intrinsics I should look at? Thanks, Ravi

Vladimir_Sedach · ‎05-26-2015

Ravi,

I'm using a code like this:
// "sign" allows to compare unsigned numbers with _mm256_cmpgt_epi32
// after _mm256_shuffle_epi8 we have 15, 14,...0, 31, 30,...16
// after_mm256_permute2f128_si256 we have 31, 30,...16, 15, 14,...0

   __m256i   ff = _mm256_set1_epi32(-1);
   __m256i   idx = _mm256_setr_epi8(
       15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0,
       15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0);
   __m256i   sign = _mm256_set1_epi32(0x80000000);
   __m256i   v0, v1;
   __m256i   eq, gt0, gt1;

v0 = _mm256_loadu_si256((__m256i *)a);
v1 = _mm256_loadu_si256((__m256i *)b);

   eq = _mm256_cmpeq_epi32(v0, v1);
   if (!_mm256_testc_si256(eq, ff))   //not equal
   {
       v0 = _mm256_shuffle_epi8(v0, idx);
       v1 = _mm256_shuffle_epi8(v1, idx);

v0 = _mm256_xor_si256(v0, sign);
v1 = _mm256_xor_si256(v1, sign);

v0 = _mm256_permute2f128_si256(v0, v0, 0x01);
v1 = _mm256_permute2f128_si256(v1, v1, 0x01);

gt0 = _mm256_cmpgt_epi32(v0, v1);
gt1 = _mm256_cmpgt_epi32(v1, v0);

return _mm256_movemask_ps(_mm256_castsi256_ps(gt0)) - _mm256_movemask_ps(_mm256_castsi256_ps(gt1));
}