Intel® Integrated Performance Primitives
Deliberate problems developing high-performance vision, signal, security, and storage applications.

Speed of intrinsics vs normal c

thorsan1
Beginner
356 Views
Hi iam testing intrinsics vs normal C, adn the intrinsic code is slower: I am just trying to do a simple vector addition:

void Add_SSE(const __m128* a,const __m128* b,__m128* c)
{
for(int i=0;i256;i++)
{
c = _mm_add_ps(a,b);
}
}

void Add(const float* a,const float* b,float* c)
{
for(int i=0;i LENGTH;i++}
{
c = a + b;
}
}

#define MEM_ALGN __declspec(align(32))

void main()
{
MEM_ALGN __m128 aSSE[LENGTH/4];
MEM_ALGN __m128 bSSE[LENGTH/4];
MEM_ALGN __m128 cSSE[LENGTH/4];

MEM_ALGN float a[LENGTH];
MEM_ALGN float b[LENGTH];
MEM_ALGN float c[LENGTH];

const int N = 1000000;
DWORD s,e;
s = timeGetTime();
for(int i=0;i
Add_SSE(aSSE,bSSE,cSSE);
e = timeGetTime();

std::cout "SSE Took: " e-s " ms" std::endl;

s = timeGetTime();
for(int i=0;i
Add(a,b,c);
e = timeGetTime();

std::cout "C Took: " e-s " ms" std::endl;
}


The intrinsic function takes 20% more time. I also tried using IPP but it runs as slow as the intrinsics. What am I doing wrong?This is on a pentium 4

Thank you for any help

thorsan

Message Edited by thorsan on 05-11-200607:46 AM

Message Edited by thorsan on 05-11-200607:48 AM

Message Edited by thorsan on 05-11-200607:48 AM

0 Kudos
2 Replies
Vladimir_Dudnik
Employee
356 Views

Hi,

I see you use 256 elements array for SSE case, but it is not clear what length is for C code.

Regards,
Vladimir

0 Kudos
Vladimir_Dudnik
Employee
356 Views

Additional comment on this,

you use not initilaized data for Ipp32f type it can significantly slowdown execution. Second reason is that loop in raw C code will be enrolled by compiler but for intrinsic case will not be. If you add enrollment by hand you should get performance imporvement.

Vladimir

0 Kudos
Reply