why is this so slow.....

thorsan1 · ‎06-12-2006

Hi

Im experimenting a little bit with the SSE2 instructions and IPP. To add 2 complex vectors i have written the following procedure:

void AddSSE(const float* ar,const float* ai,const float* br,const float* bi,float* cr,float* ci)

{

__m128* aiSSE = (__m128*)ai;

__m128* biSSE = (__m128*)bi;

__m128* ciSSE = (__m128*)ci;

__m128* arSSE = (__m128*)ar;

__m128* brSSE = (__m128*)br;

__m128* crSSE = (__m128*)cr;

for(int k=0;k256;k++)

{

ciSSE = _mm_add_ps(aiSSE,biSSE);

crSSE = _mm_add_ps(arSSE,brSSE);

}

}

where the vectors ar,ai ... are length 1024. then I compute the same add using IPP functions:

void AddIPP(const Ipp32fc* a,const Ipp32fc* b,Ipp32fc* c)

{

ippsAdd_32fc(a,b,c,LENGTH);

}

The IPP version goes apporx 10 times faster than the SSE2 version o wrote. What am i doing wrong here? what can I do to speed this up. When computing real valued adds and muls, I am able to make the SSE run as fast as the IPP, but the complex ones, I am far off.

thanks

thor andreas

Vladimir_Dudnik · ‎06-13-2006

Hello,

why just not use IPP if it is faster?

You may want to unroll the loop and take care that SIMD instruction process multiple data on each call. Also memory access better if it is aligned on 16 byte boundary or even on cache line boundary.

Regards,
Vladimir

thorsan · ‎06-21-2006

why not use ipp?

It might not be preferable to have the data stored in memory as ipps32fc. So i would like to find out how to program such functions fast. The memory used for the testing was 16 byte aligned, so that shouldnt be problem. would prefetching make this run faster? do you think loop unrolling will make it run 10 times faster? Im not able to test this new approaches right know since im on holiday.

thorsan

thorsan1 · ‎06-21-2006

"take care that SIMD instruction process multiple data on each call"

what do you mean by this?

thorsan
(sorry, two different ID's)

Vladimir_Dudnik · ‎06-30-2006

Hello, I meant that you need to study compiler documentation for details about using intrinsics. It sometime can generate code which processes one item data per time whereas it should be four items per time (SIMD - single instruction - multiple data).

Regards,
Vladimir