- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

y=x/4

tried _mm_div_epi16(x,4) , it could not work. in the manual i have seen _mm_div_pd(a,b) but this is for sse and not sse2. plaese help

Link Copied

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

There is no instruction capable of integer division in SSEx hence no intrinsic. However, for a division by a power of two, you can use the shift intrinsic : _mm_srai_epi16 for signed integers and _mm_srli_epi16 for unsigned integers.

For other values, constants, you might be able to use multiplication followed by a shift in order to achieve an integer division.

Regards,

Matthieu

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

y = x*.25

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

y= _mm_castps_si128(_mm_div_ps(_mm_castsi128_ps(x), _mm_castsi128_ps(q));

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

I have similiar question about that:

I would like to divide 8 bit (char) by power or 2.

but there is no bit shift or divide intrinsic for 8bit data array in SSE1/SSE2.

how should I do ?

thank you.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

if the power of 2 is constant,you could try the following in sse (in pseudocode)

input = mm_set_epi8(

input_4_shift_a=mm_unpacklo_epi8(input, 0);

input_4_shift_b=mm_unpackhi_epi8(input, 0);

mm_srli_epi16(input_4_shift_a, count) // count from 2^{count}mm_srli_epi16(input_4_shift_b, count) // count from 2^{count}

repeat the above for the next 16 chars

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

as your instruction, I do the code:

/*

*

*

*/

#include

#include

#define MALLOC_ALIGN_16BYTE(_size) _aligned_malloc( _size, 16)

#define FREE_ALIGN_16BYTE(ptr) _aligned_free(ptr)

int main(void)

{

int n = 32;

char *input, *sseOut;

input = (char*)MALLOC_ALIGN_16BYTE(n*sizeof(char));

sseOut = (char*)MALLOC_ALIGN_16BYTE(n*sizeof(char));

for(int i = 0; i< n;i++){

input

*= i;*

}/*for i*/

__m128i *pInput, *pOutput;

__m128i zero;

pInput = (__m128i*)input;

pOutput =(__m128i*)sseOut;

zero = _mm_set_epi32(0, 0, 0, 0);

int m = n/8;

__m128i temp1, temp2 ;

__m128i out1, out2;

for(int i = 0; i< m;i++){

temp1 = _mm_unpacklo_epi8(pInput

}/*for i*/

__m128i *pInput, *pOutput;

__m128i zero;

pInput = (__m128i*)input;

pOutput =(__m128i*)sseOut;

zero = _mm_set_epi32(0, 0, 0, 0);

int m = n/8;

__m128i temp1, temp2 ;

__m128i out1, out2;

for(int i = 0; i< m;i++){

temp1 = _mm_unpacklo_epi8(pInput

*, zero);*

temp2 = _mm_unpackhi_epi8(pInputtemp2 = _mm_unpackhi_epi8(pInput

*, zero);*

out1 = _mm_srli_epi16(temp1 , 1);

out2 = _mm_srli_epi16(temp2 , 1);

}/*for i*/

FREE_ALIGN_16BYTE(input);

FREE_ALIGN_16BYTE(sseOut);

return 0;

}/*main*/

/*

*

*/

I found that I could not restore to original order.....

that is , temp1 = 0 , 0 , 0, 0, 1, 0 , 1, 0, 2, 0, 2, 0 ,3, 0, 3, 0

temp2 = 4 , 0 , 4, 0, 5, 0 , 5, 0, 6, 0, 6, 0 ,7, 0, 7, 0

for first sixteen values.

I wish merge the output is :

0, 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6,6, 7, 7

what instruction I should use to the goal ?

thank you lots.

out1 = _mm_srli_epi16(temp1 , 1);

out2 = _mm_srli_epi16(temp2 , 1);

}/*for i*/

FREE_ALIGN_16BYTE(input);

FREE_ALIGN_16BYTE(sseOut);

return 0;

}/*main*/

/*

*

*/

I found that I could not restore to original order.....

that is , temp1 = 0 , 0 , 0, 0, 1, 0 , 1, 0, 2, 0, 2, 0 ,3, 0, 3, 0

temp2 = 4 , 0 , 4, 0, 5, 0 , 5, 0, 6, 0, 6, 0 ,7, 0, 7, 0

for first sixteen values.

I wish merge the output is :

0, 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6,6, 7, 7

what instruction I should use to the goal ?

thank you lots.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

if you want the temp1 to look like this

0, 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6,6, 7, 7

the following should do it

temp1 = _mm_unpacklo_epi8(pInput*,pInput );*

coming back to the original problem (to apply division to 8 of your input chars at once) to bring back your result to char array you will need 2 byte shuffles mm_shuffle_epi8 and 1bzte blend mm_blendv_epi8

I did not check which SSE version is required for those ops.

you will need to increase your sseOut by two to be able to play around with the data after unpacklo,hi

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

an update to my last post

- you do not need a blend but an or
- you do not need to allocate twice as much data for sseOut

Here is an excerpt of non optimized code that will do that

temp1 = _mm_unpacklo_epi8(pInput[0], zero);

temp2 = _mm_unpackhi_epi8(pInput[0], zero);

out1 = _mm_srli_epi16(temp1 , 1);

out2 = _mm_srli_epi16(temp2 , 1);

__m128i shufMaskLo = _mm_set_epi8(0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,14,12,10,8,6,4,2,0);

__m128i shufMaskHi = _mm_set_epi8(14,12,10,8,6,4,2,0,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF,0xFF);

temp1 = _mm_shuffle_epi8(out1, shufMaskLo);

temp2 = _mm_shuffle_epi8(out2, shufMaskHi);

pOutput[0] = _mm_or_si128(temp1,temp2);

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

I have found the "standard" solution to it.

that is converion to int (4byte).

char *pSrc, *pDst is input and output array.

__m128i treat;

treat = _mm_cvtsi32_si128(pSrc

*);*

treat = _mm_unpacklo_epi8( treat, _mm_setzero_si128());

treat = _mm_unpacklo_epi16(treat, _mm_setzero_si128());

/*that is 4-byte integer now !!*/

:/*do what you want to do here*/

:

treat = _mm_packs_epi32(treat, _mm_setzero_si128());

treat = _mm_packs_epi16(treat, _mm_setzero_si128());

treat = _mm_unpacklo_epi8( treat, _mm_setzero_si128());

treat = _mm_unpacklo_epi16(treat, _mm_setzero_si128());

/*that is 4-byte integer now !!*/

:/*do what you want to do here*/

:

treat = _mm_packs_epi32(treat, _mm_setzero_si128());

treat = _mm_packs_epi16(treat, _mm_setzero_si128());

`pDst`

*= _mm_cvtsi128_si32**( treat);*

please ref The Software Optimization Cookbook

please ref The Software Optimization Cookbook

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Use a 4-byte shift instruction, but without converting to 4-Byte integers.

When you apply a 4-byte shift, the 1-byte values will be shifted the same way. The only problem is that you are not shifting in zeros on a byte level, but on a 4-byte level. This can be fixed with a "and" instruction using an appropriate mask:

__m128i a = _mm_set_epi8(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16);

__m128i tmp = _mm_srai_epi32 (a, count);

__m128i res = _mm_and_si128(tmp, _mm_set1_epi8(1 << (8-count) - 1);

(untested code)

This way, you use only 2-3 instructions.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

if the power of 2 is constant,you could try the following in sse (in pseudocode)

input = mm_set_epi8(

input_4_shift_a=mm_unpacklo_epi8(input, 0);

input_4_shift_b=mm_unpackhi_epi8(input, 0);

mm_srli_epi16(input_4_shift_a, count) // count from 2^{count}mm_srli_epi16(input_4_shift_b, count) // count from 2^{count}

repeat the above for the next 16 chars

______________________

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

if the power of 2 is constant,you could try the following in sse (in pseudocode)

input = mm_set_epi8(

input_4_shift_a=mm_unpacklo_epi8(input, 0);

input_4_shift_b=mm_unpackhi_epi8(input, 0);

mm_srli_epi16(input_4_shift_a, count) // count from 2^{count}mm_srli_epi16(input_4_shift_b, count) // count from 2^{count}

repeat the above for the next 16 chars

However, this requires 6 instructions (2 unpack, 2 shift, 2 pack) in order to process 16 values. If you use

__m128i tmp = _mm_srai_epi32 (a, count);

__m128i res = _mm_and_si128(tmp, _mm_set1_epi8(1 << (8-count) - 1);

you need only 3 instructions (shift, load, and), because "_mm_set1_epi8(1 << (8-count) - 1)" can be evaluated at compile time. Ifthisoperation is used in a tight loop and the result of _mm_set1_epi8(1 << (8-count) - 1) can be kept in a register,you effectively need only 2 instructions per loop trip.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page