MMX intrinsics performed bad

Smart_Lubobya · ‎10-19-2010

i could not understand why MMX code were slower than those in c++. results for C++ was 0.000180ms, those for MMX intrinsics was 0.000280ms.any explaination? i thought parallel addition was faster than serial addition!
#include "stdafx.h"

#include

int _tmain(int argc, _TCHAR* argv[])

{

UINT64 startCount, endCount, diffCount, freq;

QueryPerformanceCounter((LARGE_INTEGER*)&startCount);

QueryPerformanceCounter((LARGE_INTEGER*)&endCount);

short block[4][4] ={1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4};

int j;

// c ++ codes

/*

for(j =0;j<4;j++)

{

int s0 =block[0]+block[3];

int s3 =block[0]-block[3];

int s1 =block[1]+block[2];

int s2 =block[1]-block[2];

block[0]=s0+s1;

block[2]= s0-s1;

block[1]= s2+(s3<<1);

block[3]= s3-(s2<<1);

}

*/

// MMX codes

__m64*block2 =(__m64*)block;

__m64 s0,s1,s2,s3;

j=0;

s0 =_mm_add_pi16(block2,block2[3+j]);

s3 =_mm_sub_pi16(block2,block2[3+j]);

s1 =_mm_add_pi16(block2[1+j],block2[2+j]);

s2 =_mm_sub_pi16(block2[1+j],block2[2+j]);

block2=_mm_add_pi16(s0,s1);

block2[2+j]= _mm_sub_pi16(s0,s1);

block2[1+j]= _mm_add_pi16(s2,(_mm_slli_pi16(s3,1)));

block2[3+j]= _mm_sub_pi16(s3,(_mm_slli_pi16(s2,1)));

_mm_empty();

diffCount = endCount - startCount;

QueryPerformanceFrequency((LARGE_INTEGER*)&freq);

double exeTime_in_ms = (double)diffCount * 1000.0 / freq;

printf("Executing time : %fms\\n", exeTime_in_ms);

return 0;

}

Thomas_W_Intel · ‎10-20-2010

It is really hard to measure such a short time precisely. I suggest that you put a loop around your code and execute it 1000 times.

Furthermore, I suggest that you have a look at the generated assembly code to verify that the compiler generates the code that you expect.