- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
i could not understand why MMX code were slower than those in c++. results for C++ was 0.000180ms, those for MMX intrinsics was 0.000280ms.any explaination? i thought parallel addition was faster than serial addition!
#include "stdafx.h"
#include
#include
#include
int _tmain(int argc, _TCHAR* argv[])
{
UINT64 startCount, endCount, diffCount, freq;
QueryPerformanceCounter((LARGE_INTEGER*)&startCount);
QueryPerformanceCounter((LARGE_INTEGER*)&endCount);
short block[4][4] ={1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4};
int j;
// c ++ codes
/*
for(j =0;j<4;j++)
{
int s0 =block[0]
int s3 =block[0]
int s1 =block[1]
int s2 =block[1]
block[0]
block[2]
block[1]
block[3]
}
*/
// MMX codes
__m64*block2 =(__m64*)block;
__m64 s0,s1,s2,s3;
j=0;
s0 =_mm_add_pi16(block2
s3 =_mm_sub_pi16(block2
s1 =_mm_add_pi16(block2[1+j],block2[2+j]);
s2 =_mm_sub_pi16(block2[1+j],block2[2+j]);
block2
block2[2+j]= _mm_sub_pi16(s0,s1);
block2[1+j]= _mm_add_pi16(s2,(_mm_slli_pi16(s3,1)));
block2[3+j]= _mm_sub_pi16(s3,(_mm_slli_pi16(s2,1)));
_mm_empty();
diffCount = endCount - startCount;
QueryPerformanceFrequency((LARGE_INTEGER*)&freq);
double exeTime_in_ms = (double)diffCount * 1000.0 / freq;
printf("Executing time : %fms\\n", exeTime_in_ms);
return 0;
}
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Furthermore, I suggest that you have a look at the generated assembly code to verify that the compiler generates the code that you expect.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page