Speed of intrinsics vs normal c

thorsan1 · ‎05-11-2006

Hi iam testing intrinsics vs normal C, adn the intrinsic code is slower: I am just trying to do a simple vector addition:

void Add_SSE(const __m128* a,const __m128* b,__m128* c)
{
for(int i=0;i256;i++)
{
c = _mm_add_ps(a,b);
}
}

void Add(const float* a,const float* b,float* c)
{
for(int i=0;i LENGTH;i++}
{
c = a + b;
}
}

#define MEM_ALGN __declspec(align(32))

void main()
{
MEM_ALGN __m128 aSSE[LENGTH/4];
MEM_ALGN __m128 bSSE[LENGTH/4];
MEM_ALGN __m128 cSSE[LENGTH/4];

MEM_ALGN float a[LENGTH];
MEM_ALGN float b[LENGTH];
MEM_ALGN float c[LENGTH];

const int N = 1000000;
DWORD s,e;
s = timeGetTime();
for(int i=0;i
Add_SSE(aSSE,bSSE,cSSE);
e = timeGetTime();

std::cout "SSE Took: " e-s " ms" std::endl;

s = timeGetTime();
for(int i=0;i
Add(a,b,c);
e = timeGetTime();

std::cout "C Took: " e-s " ms" std::endl;
}

The intrinsic function takes 20% more time. I also tried using IPP but it runs as slow as the intrinsics. What am I doing wrong?This is on a pentium 4

Thank you for any help

thorsan
Message Edited by thorsan on 05-11-200607:46 AM
Message Edited by thorsan on 05-11-200607:48 AM
Message Edited by thorsan on 05-11-200607:48 AM

Vladimir_Dudnik · ‎05-12-2006

Hi,

I see you use 256 elements array for SSE case, but it is not clear what length is for C code.

Regards,
Vladimir

Vladimir_Dudnik · ‎05-17-2006

Additional comment on this,

you use not initilaized data for Ipp32f type it can significantly slowdown execution. Second reason is that loop in raw C code will be enrolled by compiler but for intrinsic case will not be. If you add enrollment by hand you should get performance imporvement.

Vladimir