- Als neu kennzeichnen
- Lesezeichen
- Abonnieren
- Stummschalten
- RSS-Feed abonnieren
- Kennzeichnen
- Anstößigen Inhalt melden
Hi iam testing intrinsics vs normal C, adn the intrinsic code is slower: I am just trying to do a simple vector addition:
void Add_SSE(const __m128* a,const __m128* b,__m128* c)
{
for(int i=0;i256;i++)
{
c = _mm_add_ps(a,b);
}
}
void Add(const float* a,const float* b,float* c)
{
for(int i=0;i LENGTH;i++}
{
c = a + b;
}
}
#define MEM_ALGN __declspec(align(32))
void main()
{
MEM_ALGN __m128 aSSE[LENGTH/4];
MEM_ALGN __m128 bSSE[LENGTH/4];
MEM_ALGN __m128 cSSE[LENGTH/4];
MEM_ALGN float a[LENGTH];
MEM_ALGN float b[LENGTH];
MEM_ALGN float c[LENGTH];
const int N = 1000000;
DWORD s,e;
s = timeGetTime();
for(int i=0;i
Add_SSE(aSSE,bSSE,cSSE);
e = timeGetTime();
std::cout "SSE Took: " e-s " ms" std::endl;
s = timeGetTime();
for(int i=0;i
Add(a,b,c);
e = timeGetTime();
std::cout "C Took: " e-s " ms" std::endl;
}
The intrinsic function takes 20% more time. I also tried using IPP but it runs as slow as the intrinsics. What am I doing wrong?This is on a pentium 4
Thank you for any help
thorsan
void Add_SSE(const __m128* a,const __m128* b,__m128* c)
{
for(int i=0;i256;i++)
{
c = _mm_add_ps(a,b);
}
}
void Add(const float* a,const float* b,float* c)
{
for(int i=0;i LENGTH;i++}
{
c = a + b;
}
}
#define MEM_ALGN __declspec(align(32))
void main()
{
MEM_ALGN __m128 aSSE[LENGTH/4];
MEM_ALGN __m128 bSSE[LENGTH/4];
MEM_ALGN __m128 cSSE[LENGTH/4];
MEM_ALGN float a[LENGTH];
MEM_ALGN float b[LENGTH];
MEM_ALGN float c[LENGTH];
const int N = 1000000;
DWORD s,e;
s = timeGetTime();
for(int i=0;i
Add_SSE(aSSE,bSSE,cSSE);
e = timeGetTime();
std::cout "SSE Took: " e-s " ms" std::endl;
s = timeGetTime();
for(int i=0;i
Add(a,b,c);
e = timeGetTime();
std::cout "C Took: " e-s " ms" std::endl;
}
The intrinsic function takes 20% more time. I also tried using IPP but it runs as slow as the intrinsics. What am I doing wrong?This is on a pentium 4
Thank you for any help
thorsan
Message Edited by thorsan on 05-11-200607:46 AM
Message Edited by thorsan on 05-11-200607:48 AM
Message Edited by thorsan on 05-11-200607:48 AM
Link kopiert
2 Antworten
- Als neu kennzeichnen
- Lesezeichen
- Abonnieren
- Stummschalten
- RSS-Feed abonnieren
- Kennzeichnen
- Anstößigen Inhalt melden
Hi,
I see you use 256 elements array for SSE case, but it is not clear what length is for C code.
Regards,
Vladimir
- Als neu kennzeichnen
- Lesezeichen
- Abonnieren
- Stummschalten
- RSS-Feed abonnieren
- Kennzeichnen
- Anstößigen Inhalt melden
Additional comment on this,
you use not initilaized data for Ipp32f type it can significantly slowdown execution. Second reason is that loop in raw C code will be enrolled by compiler but for intrinsic case will not be. If you add enrollment by hand you should get performance imporvement.
Vladimir

Antworten
Themen-Optionen
- RSS-Feed abonnieren
- Thema als neu kennzeichnen
- Thema als gelesen kennzeichnen
- Diesen Thema für aktuellen Benutzer floaten
- Lesezeichen
- Abonnieren
- Drucker-Anzeigeseite