division sse2 intrinsic

Smart_Lubobya · ‎05-27-2010

how would i apply the sse2, intrinsic on divion such as:
y=x/4
tried _mm_div_epi16(x,4) , it could not work. in the manual i have seen _mm_div_pd(a,b) but this is for sse and not sse2. plaese help

matthieu_darbois · ‎05-27-2010

Hi,

There is no instruction capable of integer division in SSEx hence no intrinsic. However, for a division by a power of two, you can use the shift intrinsic : _mm_srai_epi16 for signed integers and _mm_srli_epi16 for unsigned integers.
For other values, constants, you might be able to use multiplication followed by a shift in order to achieve an integer division.

Regards,
Matthieu

TimP · ‎05-27-2010

All SSE2 implementations include SSE intrinsics. However, _mm_div_pd would be an SSE2 intrinsic, not SSE, for those compilers which still make such a distinction (no longer including Intel C++). For special cases such as this, multiplication gives exactly the same answer with far better performance. Most compilers have an ability to auto-vectorize with _mm_mul_pd; Intel C++ optimizes this only when -prec-div is not set, so, unfortunately, the optimization doesn't happen with standard-compliant options, and should be written in:
y = x*.25

Smart_Lubobya · ‎07-22-2010

thanks for the replies. just one more question. supose i want to divide two variable such as y=x/q, how do i achieve this in sse2?

Brijender_B_Intel · ‎07-22-2010

You may definitely want to do 4 divisions in one go with xmm register.

y= _mm_castps_si128(_mm_div_ps(_mm_castsi128_ps(x), _mm_castsi128_ps(q));

neni · ‎07-22-2010

Fp division and int division don't give same result always (esp with numbers > 2^24)

Gaiger_Chen · ‎11-29-2010

HI

I have similiar question about that:

I would like to divide 8 bit (char) by power or 2.

but there is no bit shift or divide intrinsic for 8bit data array in SSE1/SSE2.

how should I do ?

thank you.

Taronyu · ‎11-30-2010

Perhaps you'll have a look at MMX, chances are it might support that kind of operation.

Nicolae_P_Intel · ‎11-30-2010

if the power of 2 is constant,you could try the following in sse (in pseudocode)

input = mm_set_epi8();
input_4_shift_a=mm_unpacklo_epi8(input, 0);

input_4_shift_b=mm_unpackhi_epi8(input, 0);

mm_srli_epi16(input_4_shift_a, count) // count from 2^countmm_srli_epi16(input_4_shift_b, count) // count from 2^count

repeat the above for the next 16 chars

Smart_Lubobya · ‎11-30-2010

Hi

as your instruction, I do the code:

/*
*
*
*/

#include
#include /*SSE2*/

#define MALLOC_ALIGN_16BYTE(_size) _aligned_malloc( _size, 16)
#define FREE_ALIGN_16BYTE(ptr) _aligned_free(ptr)

int main(void)
{
int n = 32;
char *input, *sseOut;

input = (char*)MALLOC_ALIGN_16BYTE(n*sizeof(char));
sseOut = (char*)MALLOC_ALIGN_16BYTE(n*sizeof(char));

for(int i = 0; i< n;i++){
input = i;
}/*for i*/

__m128i *pInput, *pOutput;
__m128i zero;

pInput = (__m128i*)input;
pOutput =(__m128i*)sseOut;

zero = _mm_set_epi32(0, 0, 0, 0);

int m = n/8;

__m128i temp1, temp2 ;
__m128i out1, out2;
for(int i = 0; i< m;i++){

temp1 = _mm_unpacklo_epi8(pInput, zero);
temp2 = _mm_unpackhi_epi8(pInput, zero);

out1 = _mm_srli_epi16(temp1 , 1);
out2 = _mm_srli_epi16(temp2 , 1);

}/*for i*/

FREE_ALIGN_16BYTE(input);
FREE_ALIGN_16BYTE(sseOut);

return 0;

}/*main*/

/*
*
*/

I found that I could not restore to original order.....

that is , temp1 = 0 , 0 , 0, 0, 1, 0 , 1, 0, 2, 0, 2, 0 ,3, 0, 3, 0
temp2 = 4 , 0 , 4, 0, 5, 0 , 5, 0, 6, 0, 6, 0 ,7, 0, 7, 0

for first sixteen values.

I wish merge the output is :

0, 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6,6, 7, 7

what instruction I should use to the goal ?

thank you lots.

Nicolae_P_Intel · ‎12-02-2010

if you want the temp1 to look like this

0, 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6,6, 7, 7

the following should do it

temp1 = _mm_unpacklo_epi8(pInput,pInput);

coming back to the original problem (to apply division to 8 of your input chars at once) to bring back your result to char array you will need 2 byte shuffles mm_shuffle_epi8 and 1bzte blend mm_blendv_epi8

I did not check which SSE version is required for those ops.

you will need to increase your sseOut by two to be able to play around with the data after unpacklo,hi

Nicolae_P_Intel · ‎12-03-2010

an update to my last post

- you do not need a blend but an or
- you do not need to allocate twice as much data for sseOut

Here is an excerpt of non optimized code that will do that

temp1 = _mm_unpacklo_epi8(pInput[0], zero);

temp2 = _mm_unpackhi_epi8(pInput[0], zero);

out1 = _mm_srli_epi16(temp1 , 1);

out2 = _mm_srli_epi16(temp2 , 1);

__m128i shufMaskLo = _mm_set_epi8(0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,14,12,10,8,6,4,2,0);

__m128i shufMaskHi = _mm_set_epi8(14,12,10,8,6,4,2,0,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF);

temp1 = _mm_shuffle_epi8(out1, shufMaskLo);

temp2 = _mm_shuffle_epi8(out2, shufMaskHi);

pOutput[0] = _mm_or_si128(temp1,temp2);

Nicolae_P_Intel · ‎12-03-2010

I would stay out of using the divider in this case. A shift (way faster)to the right with 2 would suffice, right? Please see my previous post for an example of right shift.

Gaiger_Chen · ‎12-20-2010

Hi:

I have found the "standard" solution to it.

that is converion to int (4byte).

char *pSrc, *pDst is input and output array.

__m128i treat;

treat = _mm_cvtsi32_si128(pSrc);

treat = _mm_unpacklo_epi8( treat, _mm_setzero_si128());
treat = _mm_unpacklo_epi16(treat, _mm_setzero_si128());

/*that is 4-byte integer now !!*/

:/*do what you want to do here*/
:

treat = _mm_packs_epi32(treat, _mm_setzero_si128());
treat = _mm_packs_epi16(treat, _mm_setzero_si128());

pDst = _mm_cvtsi128_si32( treat);

please ref The Software Optimization Cookbook

Thomas_W_Intel · ‎12-31-2010

May I suggest an alternative solution that should be faster?

Use a 4-byte shift instruction, but without converting to 4-Byte integers.

When you apply a 4-byte shift, the 1-byte values will be shifted the same way. The only problem is that you are not shifting in zeros on a byte level, but on a 4-byte level. This can be fixed with a "and" instruction using an appropriate mask:

__m128i a = _mm_set_epi8(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16);
__m128i tmp = _mm_srai_epi32 (a, count);
__m128i res = _mm_and_si128(tmp, _mm_set1_epi8(1 << (8-count) - 1);
(untested code)

This way, you use only 2-3 instructions.

mrphantuan · ‎01-04-2011

if the power of 2 is constant,you could try the following in sse (in pseudocode)

input = mm_set_epi8();
input_4_shift_a=mm_unpacklo_epi8(input, 0);

input_4_shift_b=mm_unpackhi_epi8(input, 0);

mm_srli_epi16(input_4_shift_a, count) // count from 2^countmm_srli_epi16(input_4_shift_b, count) // count from 2^count

repeat the above for the next 16 chars

______________________

Thomas_W_Intel · ‎01-07-2011

Quoting mrphantuan

if the power of 2 is constant,you could try the following in sse (in pseudocode)
input = mm_set_epi8();
input_4_shift_a=mm_unpacklo_epi8(input, 0);
input_4_shift_b=mm_unpackhi_epi8(input, 0);

mm_srli_epi16(input_4_shift_a, count) // count from 2^countmm_srli_epi16(input_4_shift_b, count) // count from 2^count

repeat the above for the next 16 chars

However, this requires 6 instructions (2 unpack, 2 shift, 2 pack) in order to process 16 values. If you use

__m128i tmp = _mm_srai_epi32 (a, count);

__m128i res = _mm_and_si128(tmp, _mm_set1_epi8(1 << (8-count) - 1);

you need only 3 instructions (shift, load, and), because "_mm_set1_epi8(1 << (8-count) - 1)" can be evaluated at compile time. Ifthisoperation is used in a tight loop and the result of _mm_set1_epi8(1 << (8-count) - 1) can be kept in a register,you effectively need only 2 instructions per loop trip.