Intel® ISA Extensions
Use hardware-based isolation and memory encryption to provide more code protection in your solutions.

division sse2 intrinsic

Smart_Lubobya
Beginner
3,334 Views
how would i apply the sse2, intrinsic on divion such as:
y=x/4
tried _mm_div_epi16(x,4) , it could not work. in the manual i have seen _mm_div_pd(a,b) but this is for sse and not sse2. plaese help
0 Kudos
16 Replies
matthieu_darbois
New Contributor III
3,334 Views
Hi,

There is no instruction capable of integer division in SSEx hence no intrinsic. However, for a division by a power of two, you can use the shift intrinsic : _mm_srai_epi16 for signed integers and _mm_srli_epi16 for unsigned integers.
For other values, constants, you might be able to use multiplication followed by a shift in order to achieve an integer division.

Regards,
Matthieu
0 Kudos
TimP
Honored Contributor III
3,334 Views
All SSE2 implementations include SSE intrinsics. However, _mm_div_pd would be an SSE2 intrinsic, not SSE, for those compilers which still make such a distinction (no longer including Intel C++). For special cases such as this, multiplication gives exactly the same answer with far better performance. Most compilers have an ability to auto-vectorize with _mm_mul_pd; Intel C++ optimizes this only when -prec-div is not set, so, unfortunately, the optimization doesn't happen with standard-compliant options, and should be written in:
y = x*.25
0 Kudos
Smart_Lubobya
Beginner
3,334 Views
thanks for the replies. just one more question. supose i want to divide two variable such as y=x/q, how do i achieve this in sse2?
0 Kudos
Brijender_B_Intel
3,334 Views
You may definitely want to do 4 divisions in one go with xmm register.



y= _mm_castps_si128(_mm_div_ps(_mm_castsi128_ps(x), _mm_castsi128_ps(q));

0 Kudos
neni
New Contributor II
3,334 Views
Fp division and int division don't give same result always (esp with numbers > 2^24)
0 Kudos
Gaiger_Chen
New Contributor I
3,334 Views
HI

I have similiar question about that:

I would like to divide 8 bit (char) by power or 2.

but there is no bit shift or divide intrinsic for 8bit data array in SSE1/SSE2.


how should I do ?


thank you.
0 Kudos
Taronyu
Beginner
3,334 Views
Perhaps you'll have a look at MMX, chances are it might support that kind of operation.
0 Kudos
Nicolae_P_Intel
Employee
3,334 Views

if the power of 2 is constant,you could try the following in sse (in pseudocode)

input = mm_set_epi8();
input_4_shift_a=mm_unpacklo_epi8(input, 0);

input_4_shift_b=mm_unpackhi_epi8(input, 0);

mm_srli_epi16(input_4_shift_a, count) // count from 2count
mm_srli_epi16(input_4_shift_b, count) // count from 2count

repeat the above for the next 16 chars

0 Kudos
Smart_Lubobya
Beginner
3,334 Views
Hi

as your instruction, I do the code:

/*
*
*
*/

#include
#include /*SSE2*/


#define MALLOC_ALIGN_16BYTE(_size) _aligned_malloc( _size, 16)
#define FREE_ALIGN_16BYTE(ptr) _aligned_free(ptr)


int main(void)
{
int n = 32;
char *input, *sseOut;

input = (char*)MALLOC_ALIGN_16BYTE(n*sizeof(char));
sseOut = (char*)MALLOC_ALIGN_16BYTE(n*sizeof(char));


for(int i = 0; i< n;i++){
input = i;
}/*for i*/

__m128i *pInput, *pOutput;
__m128i zero;

pInput = (__m128i*)input;
pOutput =(__m128i*)sseOut;

zero = _mm_set_epi32(0, 0, 0, 0);


int m = n/8;

__m128i temp1, temp2 ;
__m128i out1, out2;
for(int i = 0; i< m;i++){

temp1 = _mm_unpacklo_epi8(pInput, zero);
temp2 = _mm_unpackhi_epi8(pInput, zero);

out1 = _mm_srli_epi16(temp1 , 1);
out2 = _mm_srli_epi16(temp2 , 1);


}/*for i*/



FREE_ALIGN_16BYTE(input);
FREE_ALIGN_16BYTE(sseOut);


return 0;

}/*main*/

/*
*
*/



I found that I could not restore to original order.....

that is , temp1 = 0 , 0 , 0, 0, 1, 0 , 1, 0, 2, 0, 2, 0 ,3, 0, 3, 0
temp2 = 4 , 0 , 4, 0, 5, 0 , 5, 0, 6, 0, 6, 0 ,7, 0, 7, 0

for first sixteen values.

I wish merge the output is :

0, 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6,6, 7, 7

what instruction I should use to the goal ?

thank you lots.

0 Kudos
Nicolae_P_Intel
Employee
3,334 Views

if you want the temp1 to look like this

0, 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6,6, 7, 7

the following should do it

temp1 = _mm_unpacklo_epi8(pInput,pInput);

coming back to the original problem (to apply division to 8 of your input chars at once) to bring back your result to char array you will need 2 byte shuffles mm_shuffle_epi8 and 1bzte blend mm_blendv_epi8

I did not check which SSE version is required for those ops.

you will need to increase your sseOut by two to be able to play around with the data after unpacklo,hi

0 Kudos
Nicolae_P_Intel
Employee
3,334 Views

an update to my last post

    • you do not need a blend but an or
    • you do not need to allocate twice as much data for sseOut

Here is an excerpt of non optimized code that will do that

temp1 = _mm_unpacklo_epi8(pInput[0], zero);

temp2 = _mm_unpackhi_epi8(pInput[0], zero);

out1 = _mm_srli_epi16(temp1 , 1);

out2 = _mm_srli_epi16(temp2 , 1);

__m128i shufMaskLo = _mm_set_epi8(0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,14,12,10,8,6,4,2,0);

__m128i shufMaskHi = _mm_set_epi8(14,12,10,8,6,4,2,0,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF);

temp1 = _mm_shuffle_epi8(out1, shufMaskLo);

temp2 = _mm_shuffle_epi8(out2, shufMaskHi);

pOutput[0] = _mm_or_si128(temp1,temp2);

0 Kudos
Nicolae_P_Intel
Employee
3,334 Views
I would stay out of using the divider in this case. A shift (way faster)to the right with 2 would suffice, right? Please see my previous post for an example of right shift.
0 Kudos
Gaiger_Chen
New Contributor I
3,334 Views
Hi:

I have found the "standard" solution to it.

that is converion to int (4byte).


char *pSrc, *pDst is input and output array.

__m128i treat;

treat = _mm_cvtsi32_si128(pSrc);

treat = _mm_unpacklo_epi8( treat, _mm_setzero_si128());
treat = _mm_unpacklo_epi16(treat, _mm_setzero_si128());

/*that is 4-byte integer now !!*/

:/*do what you want to do here*/
:

treat = _mm_packs_epi32(treat, _mm_setzero_si128());
treat = _mm_packs_epi16(treat, _mm_setzero_si128());

pDst = _mm_cvtsi128_si32( treat);

please ref The Software Optimization Cookbook
0 Kudos
Thomas_W_Intel
Employee
3,334 Views
May I suggest an alternative solution that should be faster?

Use a 4-byte shift instruction, but without converting to 4-Byte integers.

When you apply a 4-byte shift, the 1-byte values will be shifted the same way. The only problem is that you are not shifting in zeros on a byte level, but on a 4-byte level. This can be fixed with a "and" instruction using an appropriate mask:

__m128i a = _mm_set_epi8(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16);
__m128i tmp = _mm_srai_epi32 (a, count);
__m128i res = _mm_and_si128(tmp, _mm_set1_epi8(1 << (8-count) - 1);
(untested code)

This way, you use only 2-3 instructions.
0 Kudos
mrphantuan
Beginner
3,334 Views

if the power of 2 is constant,you could try the following in sse (in pseudocode)

input = mm_set_epi8();
input_4_shift_a=mm_unpacklo_epi8(input, 0);

input_4_shift_b=mm_unpackhi_epi8(input, 0);

mm_srli_epi16(input_4_shift_a, count) // count from 2count
mm_srli_epi16(input_4_shift_b, count) // count from 2count

repeat the above for the next 16 chars

______________________

Du hoc

Tu van du hoc

Hoc bong du hoc

Du hoc Singapore

0 Kudos
Thomas_W_Intel
Employee
3,334 Views
Quoting mrphantuan

if the power of 2 is constant,you could try the following in sse (in pseudocode)

input = mm_set_epi8();
input_4_shift_a=mm_unpacklo_epi8(input, 0);

input_4_shift_b=mm_unpackhi_epi8(input, 0);

mm_srli_epi16(input_4_shift_a, count) // count from 2count
mm_srli_epi16(input_4_shift_b, count) // count from 2count

repeat the above for the next 16 chars


However, this requires 6 instructions (2 unpack, 2 shift, 2 pack) in order to process 16 values. If you use

__m128i tmp = _mm_srai_epi32 (a, count);

__m128i res = _mm_and_si128(tmp, _mm_set1_epi8(1 << (8-count) - 1);

you need only 3 instructions (shift, load, and), because "_mm_set1_epi8(1 << (8-count) - 1)" can be evaluated at compile time. Ifthisoperation is used in a tight loop and the result of _mm_set1_epi8(1 << (8-count) - 1) can be kept in a register,you effectively need only 2 instructions per loop trip.

0 Kudos
Reply