- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi iam testing intrinsics vs normal C, adn the intrinsic code is slower: I am just trying to do a simple vector addition:
void Add_SSE(const __m128* a,const __m128* b,__m128* c)
{
for(int i=0;i256;i++)
{
c = _mm_add_ps(a,b);
}
}
void Add(const float* a,const float* b,float* c)
{
for(int i=0;i LENGTH;i++}
{
c = a + b;
}
}
#define MEM_ALGN __declspec(align(32))
void main()
{
MEM_ALGN __m128 aSSE[LENGTH/4];
MEM_ALGN __m128 bSSE[LENGTH/4];
MEM_ALGN __m128 cSSE[LENGTH/4];
MEM_ALGN float a[LENGTH];
MEM_ALGN float b[LENGTH];
MEM_ALGN float c[LENGTH];
const int N = 1000000;
DWORD s,e;
s = timeGetTime();
for(int i=0;i
Add_SSE(aSSE,bSSE,cSSE);
e = timeGetTime();
std::cout "SSE Took: " e-s " ms" std::endl;
s = timeGetTime();
for(int i=0;i
Add(a,b,c);
e = timeGetTime();
std::cout "C Took: " e-s " ms" std::endl;
}
The intrinsic function takes 20% more time. I also tried using IPP but it runs as slow as the intrinsics. What am I doing wrong?This is on a pentium 4
Thank you for any help
thorsan
void Add_SSE(const __m128* a,const __m128* b,__m128* c)
{
for(int i=0;i256;i++)
{
c = _mm_add_ps(a,b);
}
}
void Add(const float* a,const float* b,float* c)
{
for(int i=0;i LENGTH;i++}
{
c = a + b;
}
}
#define MEM_ALGN __declspec(align(32))
void main()
{
MEM_ALGN __m128 aSSE[LENGTH/4];
MEM_ALGN __m128 bSSE[LENGTH/4];
MEM_ALGN __m128 cSSE[LENGTH/4];
MEM_ALGN float a[LENGTH];
MEM_ALGN float b[LENGTH];
MEM_ALGN float c[LENGTH];
const int N = 1000000;
DWORD s,e;
s = timeGetTime();
for(int i=0;i
Add_SSE(aSSE,bSSE,cSSE);
e = timeGetTime();
std::cout "SSE Took: " e-s " ms" std::endl;
s = timeGetTime();
for(int i=0;i
Add(a,b,c);
e = timeGetTime();
std::cout "C Took: " e-s " ms" std::endl;
}
The intrinsic function takes 20% more time. I also tried using IPP but it runs as slow as the intrinsics. What am I doing wrong?This is on a pentium 4
Thank you for any help
thorsan
Message Edited by thorsan on 05-11-200607:46 AM
Message Edited by thorsan on 05-11-200607:48 AM
Message Edited by thorsan on 05-11-200607:48 AM
Link Copied
2 Replies
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I see you use 256 elements array for SSE case, but it is not clear what length is for C code.
Regards,
Vladimir
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Additional comment on this,
you use not initilaized data for Ipp32f type it can significantly slowdown execution. Second reason is that loop in raw C code will be enrolled by compiler but for intrinsic case will not be. If you add enrollment by hand you should get performance imporvement.
Vladimir
Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page