Intel® ISA Extensions
Use hardware-based isolation and memory encryption to provide more code protection in your solutions.
Announcements
The Intel sign-in experience has changed to support enhanced security controls. If you sign in, click here for more information.

division sse2 intrinsic

Smart_Lubobya
Beginner
1,184 Views
how would i apply the sse2, intrinsic on divion such as:
y=x/4
tried _mm_div_epi16(x,4) , it could not work. in the manual i have seen _mm_div_pd(a,b) but this is for sse and not sse2. plaese help
0 Kudos
16 Replies
matthieu_darbois
New Contributor III
1,184 Views
Hi,

There is no instruction capable of integer division in SSEx hence no intrinsic. However, for a division by a power of two, you can use the shift intrinsic : _mm_srai_epi16 for signed integers and _mm_srli_epi16 for unsigned integers.
For other values, constants, you might be able to use multiplication followed by a shift in order to achieve an integer division.

Regards,
Matthieu
TimP
Black Belt
1,184 Views
All SSE2 implementations include SSE intrinsics. However, _mm_div_pd would be an SSE2 intrinsic, not SSE, for those compilers which still make such a distinction (no longer including Intel C++). For special cases such as this, multiplication gives exactly the same answer with far better performance. Most compilers have an ability to auto-vectorize with _mm_mul_pd; Intel C++ optimizes this only when -prec-div is not set, so, unfortunately, the optimization doesn't happen with standard-compliant options, and should be written in:
y = x*.25
Smart_Lubobya
Beginner
1,184 Views
thanks for the replies. just one more question. supose i want to divide two variable such as y=x/q, how do i achieve this in sse2?
Brijender_B_Intel
1,184 Views
You may definitely want to do 4 divisions in one go with xmm register.



y= _mm_castps_si128(_mm_div_ps(_mm_castsi128_ps(x), _mm_castsi128_ps(q));

neni
New Contributor II
1,184 Views
Fp division and int division don't give same result always (esp with numbers > 2^24)
Gaiger_Chen
New Contributor I
1,184 Views
HI

I have similiar question about that:

I would like to divide 8 bit (char) by power or 2.

but there is no bit shift or divide intrinsic for 8bit data array in SSE1/SSE2.


how should I do ?


thank you.
Taronyu
Beginner
1,184 Views
Perhaps you'll have a look at MMX, chances are it might support that kind of operation.
Nicolae_P_Intel
Employee
1,184 Views

if the power of 2 is constant,you could try the following in sse (in pseudocode)

input = mm_set_epi8();
input_4_shift_a=mm_unpacklo_epi8(input, 0);

input_4_shift_b=mm_unpackhi_epi8(input, 0);

mm_srli_epi16(input_4_shift_a, count) // count from 2count
mm_srli_epi16(input_4_shift_b, count) // count from 2count

repeat the above for the next 16 chars

Smart_Lubobya
Beginner
1,184 Views
Hi

as your instruction, I do the code:

/*
*
*
*/

#include
#include /*SSE2*/


#define MALLOC_ALIGN_16BYTE(_size) _aligned_malloc( _size, 16)
#define FREE_ALIGN_16BYTE(ptr) _aligned_free(ptr)


int main(void)
{
int n = 32;
char *input, *sseOut;

input = (char*)MALLOC_ALIGN_16BYTE(n*sizeof(char));
sseOut = (char*)MALLOC_ALIGN_16BYTE(n*sizeof(char));


for(int i = 0; i< n;i++){
input = i;
}/*for i*/

__m128i *pInput, *pOutput;
__m128i zero;

pInput = (__m128i*)input;
pOutput =(__m128i*)sseOut;

zero = _mm_set_epi32(0, 0, 0, 0);


int m = n/8;

__m128i temp1, temp2 ;
__m128i out1, out2;
for(int i = 0; i< m;i++){

temp1 = _mm_unpacklo_epi8(pInput, zero);
temp2 = _mm_unpackhi_epi8(pInput, zero);

out1 = _mm_srli_epi16(temp1 , 1);
out2 = _mm_srli_epi16(temp2 , 1);


}/*for i*/



FREE_ALIGN_16BYTE(input);
FREE_ALIGN_16BYTE(sseOut);


return 0;

}/*main*/

/*
*
*/



I found that I could not restore to original order.....

that is , temp1 = 0 , 0 , 0, 0, 1, 0 , 1, 0, 2, 0, 2, 0 ,3, 0, 3, 0
temp2 = 4 , 0 , 4, 0, 5, 0 , 5, 0, 6, 0, 6, 0 ,7, 0, 7, 0

for first sixteen values.

I wish merge the output is :

0, 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6,6, 7, 7

what instruction I should use to the goal ?

thank you lots.

Nicolae_P_Intel
Employee
1,184 Views

if you want the temp1 to look like this

0, 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6,6, 7, 7

the following should do it

temp1 = _mm_unpacklo_epi8(pInput,pInput);

coming back to the original problem (to apply division to 8 of your input chars at once) to bring back your result to char array you will need 2 byte shuffles mm_shuffle_epi8 and 1bzte blend mm_blendv_epi8

I did not check which SSE version is required for those ops.

you will need to increase your sseOut by two to be able to play around with the data after unpacklo,hi

Nicolae_P_Intel
Employee
1,184 Views

an update to my last post

    • you do not need a blend but an or
    • you do not need to allocate twice as much data for sseOut

Here is an excerpt of non optimized code that will do that

temp1 = _mm_unpacklo_epi8(pInput[0], zero);

temp2 = _mm_unpackhi_epi8(pInput[0], zero);

out1 = _mm_srli_epi16(temp1 , 1);

out2 = _mm_srli_epi16(temp2 , 1);

__m128i shufMaskLo = _mm_set_epi8(0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,14,12,10,8,6,4,2,0);

__m128i shufMaskHi = _mm_set_epi8(14,12,10,8,6,4,2,0,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF);

temp1 = _mm_shuffle_epi8(out1, shufMaskLo);

temp2 = _mm_shuffle_epi8(out2, shufMaskHi);

pOutput[0] = _mm_or_si128(temp1,temp2);

Nicolae_P_Intel
Employee
1,184 Views
I would stay out of using the divider in this case. A shift (way faster)to the right with 2 would suffice, right? Please see my previous post for an example of right shift.
Gaiger_Chen
New Contributor I
1,184 Views
Hi:

I have found the "standard" solution to it.

that is converion to int (4byte).


char *pSrc, *pDst is input and output array.

__m128i treat;

treat = _mm_cvtsi32_si128(pSrc);

treat = _mm_unpacklo_epi8( treat, _mm_setzero_si128());
treat = _mm_unpacklo_epi16(treat, _mm_setzero_si128());

/*that is 4-byte integer now !!*/

:/*do what you want to do here*/
:

treat = _mm_packs_epi32(treat, _mm_setzero_si128());
treat = _mm_packs_epi16(treat, _mm_setzero_si128());

pDst = _mm_cvtsi128_si32( treat);

please ref The Software Optimization Cookbook
Thomas_W_Intel
Employee
1,184 Views
May I suggest an alternative solution that should be faster?

Use a 4-byte shift instruction, but without converting to 4-Byte integers.

When you apply a 4-byte shift, the 1-byte values will be shifted the same way. The only problem is that you are not shifting in zeros on a byte level, but on a 4-byte level. This can be fixed with a "and" instruction using an appropriate mask:

__m128i a = _mm_set_epi8(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16);
__m128i tmp = _mm_srai_epi32 (a, count);
__m128i res = _mm_and_si128(tmp, _mm_set1_epi8(1 << (8-count) - 1);
(untested code)

This way, you use only 2-3 instructions.
mrphantuan
Beginner
1,184 Views

if the power of 2 is constant,you could try the following in sse (in pseudocode)

input = mm_set_epi8();
input_4_shift_a=mm_unpacklo_epi8(input, 0);

input_4_shift_b=mm_unpackhi_epi8(input, 0);

mm_srli_epi16(input_4_shift_a, count) // count from 2count
mm_srli_epi16(input_4_shift_b, count) // count from 2count

repeat the above for the next 16 chars

______________________

Du hoc

Tu van du hoc

Hoc bong du hoc

Du hoc Singapore

Thomas_W_Intel
Employee
1,184 Views
Quoting mrphantuan

if the power of 2 is constant,you could try the following in sse (in pseudocode)

input = mm_set_epi8();
input_4_shift_a=mm_unpacklo_epi8(input, 0);

input_4_shift_b=mm_unpackhi_epi8(input, 0);

mm_srli_epi16(input_4_shift_a, count) // count from 2count
mm_srli_epi16(input_4_shift_b, count) // count from 2count

repeat the above for the next 16 chars


However, this requires 6 instructions (2 unpack, 2 shift, 2 pack) in order to process 16 values. If you use

__m128i tmp = _mm_srai_epi32 (a, count);

__m128i res = _mm_and_si128(tmp, _mm_set1_epi8(1 << (8-count) - 1);

you need only 3 instructions (shift, load, and), because "_mm_set1_epi8(1 << (8-count) - 1)" can be evaluated at compile time. Ifthisoperation is used in a tight loop and the result of _mm_set1_epi8(1 << (8-count) - 1) can be kept in a register,you effectively need only 2 instructions per loop trip.

Reply