__m128 vs __m128i

ivantsou · ‎08-11-2010

Hi,

I have a piece of code in the following and would like to write SSE intrinsics code for this part. But I got one problem on using __m128 and __m128i types. As you can see from the code, parameter "level" needs to be computed using __m128i type in order to do the shift operation.(there is no shift instruction for __m128, please correct me if I am wrong). But in the next line, we need to do computation between a float parameter "reQuantDownScaleFactor" and int "level", and convert the result into integer. Is there any way to do this kind of calculation using SSE intrinsics? Thank you for the answer or any comments!

------------------------------------------------------------------

static void Quantize(VideoParameters *p_Vid, Slice *currEncSlice, Macroblock *currEncMB, Slice *currDecSlice, int blkChroma, int intra, int coff[4][4], short reQuantQp)

{

int i, j;

int scaled_coeff, q_bits, level;

int qp_per;

QuantParameters *p_Quant = p_Vid->p_Quant;

LevelQuantParams **q_params_4x4;

LevelQuantParams *q_params = NULL;

float reQuantDownScaleFactor[4][4] = {

{4.00f, 3.20f, 4.00f, 3.20f},

{3.20f, 2.56f, 3.20f, 2.56f},

{4.00f, 3.20f, 4.00f, 3.20f},

{3.20f, 2.56f, 3.20f, 2.56f}

};

for (i = 0; i < 4; i++)

{

for (j = 0; j < 4; j++)

{

if (coff != 0)

{

q_params = &q_params_4x4;

scaled_coeff = iabs (coff) * q_params->ScaleComp;

if ((i == 0) && (j == 0))

{

level = (scaled_coeff + (q_params->OffsetComp << 1) ) >> q_bits;

}

else

{

level = (scaled_coeff + q_params->OffsetComp) >> q_bits;

}

level = (int) ((2.0*level + (reQuantDownScaleFactor)) / (2*reQuantDownScaleFactor));

coff = isignab(level, coff);

}

------------------------------------------------------------------

Best,

Ivan

Brijender_B_Intel · ‎08-11-2010

It looks like that you can define level as _m128i. you can use convert intrinsic to convert the result of last calculation from float to int.
_mm_cvtps_epi32(__m128 _A);

also, i will suggest move case i==0 and j==0 out of the loop when you vectorize the code. then you have non branched code in loop, which will be fast.