Community
cancel
Showing results for 
Search instead for 
Did you mean: 
gilrgrgmail_com
Beginner
334 Views

_mm_unpackhi_epi8 and _mm_unpacklo_epi8 to convert 16 signed chars into 2 signed short vectors

I am using the _mm_unpacklo_epi16 and _mm_unpackhi_epi16 with second argumet vector of 0s to convert signed/unsigned short vectors into 2 signed/unsigned integer vectors. i.e.:

__m128i lowVec  = _mm_unpacklo_epi16(vecA vec0);
__m128i highVec = _mm_unpackhi_epi16(vecA,vec0);

This works fine with 16 unsigned chars vector into 2 unsigned short  vectors using  _mm_unpacklo_epi8 and _mm_unpackhi_epi8, yet when the input vector is of 16 signed chars the 2 short values in result vectors are all 127+original values. 

I found a way to overcome this using add operation with 127, and immediately after the unpack performing substraction of the 127, yet this is very non elegant.

Another way was to use _mm_cvtepi8_epi16 and shift operations to get the wanted values - but this was less elegant than the previous add/sub and the performance was worse.

According the documentation of the _mm_unpacklo_epi8  and _mm_unpackhi_epi8 there was not suppose to be any problem with signed chars...

 

0 Kudos
6 Replies
gilgil
Beginner
334 Views

Regarding the 2nd approach :

    __m128i vecLow = _mm_cvtepi8_epi16(vec1);
    __m128i vecHigh = _mm_cvtepi8_epi16(_mm_srli_si128(vec1, 8));

The performance problem caused by the latency of _mm_srli_si128

Brandon_H_Intel
Employee
334 Views

A more fleshed out test case would be helpful to me to understand exactly how you're setting a reading your vectors. For example, the following appears to work for me ok with positive numbers (I'd have some understandable problems with negative numbers, but I don't think your complaint is about that).

#include <iostream>
#include <emmintrin.h>

int main() {
   short x[16] = {0,1,2,3,4,5,6,7};
   __m128i x128, x0, lowResult, hiResult;
   
   x0 = _mm_setzero_si128();
   x128 = _mm_set_epi16(x[7],x[6],x[5],x[4],x[3],x[2],x[1],x[0]);

   lowResult = _mm_unpacklo_epi16(x128, x0);
   hiResult = _mm_unpackhi_epi16(x128, x0);

   std::cout << "Values are: \n";
   for(int i = 0; i < 4; ++i) 
      std::cout << "hi = " << hiResult.m128i_i32 << std::endl << "low = " << lowResult.m128i_i32 << std::endl;
   return(0);
}

 

I just compile this with icl 15.0.2 default debug 32-bit configuration Microsoft Visual Studio 2013*. When I run, I get:

 

Values are:
hi = 4
low = 0
hi = 5
low = 1
hi = 6
low = 2
hi = 7
low = 3
Press any key to continue . . .

gilgil
Beginner
334 Views

Hi 

The _mm_unpacklo_epi16 and _mm_unpackhi_epi16 works fine with both the signed and unsigned shorts. The problem is with signed chars:

    __m128i vec1 = _mm_set_epi8(-11,-15,-34,-37,121,-98,45,-77,-40,-88,90,32,-14,66,53,-60);

Trying to use _mm_unpacklo_epi8 and _mm_unpackhi_epi16 on vec1 results unsigned shorts...

Brandon_H_Intel
Employee
334 Views

Hi gilgil,

I've had a developer review this, and his response is that the unpack intrinsics you're trying to use should not be expected to sign extend (nor are they documented to do as such). They simply interleave with the vec0, which effectively 0 extends the numbers. He also recommends that for your addition/subtraction modification to make this work, that you use 128, not 127, as using 127 will give you a wrong answer when the input value is exactly -128 (the answer will be +128, not -128).

If you have followup questions here, I think it would be really helpful to see a compilable/runnable example from you to ensure that we're on the same page.

gilrgrgmail_com
Beginner
334 Views

Brandon Hewitt (Intel) wrote:

Hi gilgil,

I've had a developer review this, and his response is that the unpack intrinsics you're trying to use should not be expected to sign extend (nor are they documented to do as such). They simply interleave with the vec0, which effectively 0 extends the numbers. He also recommends that for your addition/subtraction modification to make this work, that you use 128, not 127, as using 127 will give you a wrong answer when the input value is exactly -128 (the answer will be +128, not -128).

If you have followup questions here, I think it would be really helpful to see a compilable/runnable example from you to ensure that we're on the same page.

 

Thanks for the reply.

I tested it some more and came with a much better way to sign extend the unpack functions:

_mm_unpacklo/hi_epi16/8(vec0,vecA ); // Instead of _mm_unpacklo/hi_epi16/8(vecA,vec0 )

On the result I apply 16/8 bit shift right on the result .

_mm_srai_epi16/8 (_mm_unpacklo/hi_epi16/8(vec0,vecA ), 16/8); 

This way the code is much more elegant and the performance penalty is minimized. 

 

 

 

JWong19
Beginner
334 Views

I'd write in this way (1 instruction less) instead...

__m128i vecAL = _mm_unpacklo_epi8(vecA, _mm_cmplt_epi8(vecA, vec0));
__m128i vecAH = _mm_unpackhi_epi8(vecA, _mm_cmplt_epi8(vecA, vec0));

__m128i vecBL = _mm_unpacklo_epi16(vecB, _mm_cmplt_epi16(vecB, vec0));
__m128i vecBH = _mm_unpackhi_epi16(vecB, _mm_cmplt_epi16(vecB, vec0));

 

gilrgrgmail.com wrote:

Quote:

Brandon Hewitt (Intel) wrote:

 

Hi gilgil,

I've had a developer review this, and his response is that the unpack intrinsics you're trying to use should not be expected to sign extend (nor are they documented to do as such). They simply interleave with the vec0, which effectively 0 extends the numbers. He also recommends that for your addition/subtraction modification to make this work, that you use 128, not 127, as using 127 will give you a wrong answer when the input value is exactly -128 (the answer will be +128, not -128).

If you have followup questions here, I think it would be really helpful to see a compilable/runnable example from you to ensure that we're on the same page.

 

 

 

Thanks for the reply.

I tested it some more and came with a much better way to sign extend the unpack functions:

_mm_unpacklo/hi_epi16/8(vec0,vecA ); // Instead of _mm_unpacklo/hi_epi16/8(vecA,vec0 )

On the result I apply 16/8 bit shift right on the result .

_mm_srai_epi16/8 (_mm_unpacklo/hi_epi16/8(vec0,vecA ), 16/8); 

This way the code is much more elegant and the performance penalty is minimized. 

 

 

 

Reply