Intel® Integrated Performance Primitives
Deliberate problems developing high-performance vision, signal, security, and storage applications.

why is this so slow.....

thorsan1
Beginner
568 Views
Hi



Im experimenting a little bit with the SSE2 instructions and IPP. To add 2 complex vectors i have written the following procedure:



void AddSSE(const float* ar,const float* ai,const float* br,const float* bi,float* cr,float* ci)

{

__m128* aiSSE = (__m128*)ai;

__m128* biSSE = (__m128*)bi;

__m128* ciSSE = (__m128*)ci;



__m128* arSSE = (__m128*)ar;

__m128* brSSE = (__m128*)br;

__m128* crSSE = (__m128*)cr;

for(int k=0;k256;k++)

{

ciSSE = _mm_add_ps(aiSSE,biSSE);

crSSE = _mm_add_ps(arSSE,brSSE);

}

}



where the vectors ar,ai ... are length 1024. then I compute the same add using IPP functions:



void AddIPP(const Ipp32fc* a,const Ipp32fc* b,Ipp32fc* c)

{

ippsAdd_32fc(a,b,c,LENGTH);

}



The IPP version goes apporx 10 times faster than the SSE2 version o wrote. What am i doing wrong here? what can I do to speed this up. When computing real valued adds and muls, I am able to make the SSE run as fast as the IPP, but the complex ones, I am far off.



thanks



thor andreas
0 Kudos
4 Replies
Vladimir_Dudnik
Employee
568 Views

Hello,

why just not use IPP if it is faster?

You may want to unroll the loop and take care that SIMD instruction process multiple data on each call. Also memory access better if it is aligned on 16 byte boundary or even on cache line boundary.

Regards,
Vladimir

0 Kudos
thorsan
Beginner
568 Views
why not use ipp?

It might not be preferable to have the data stored in memory as ipps32fc. So i would like to find out how to program such functions fast. The memory used for the testing was 16 byte aligned, so that shouldnt be problem. would prefetching make this run faster? do you think loop unrolling will make it run 10 times faster? Im not able to test this new approaches right know since im on holiday.

thorsan
0 Kudos
thorsan1
Beginner
568 Views
"take care that SIMD instruction process multiple data on each call"

what do you mean by this?

thorsan
(sorry, two different ID's)
0 Kudos
Vladimir_Dudnik
Employee
568 Views

Hello, I meant that you need to study compiler documentation for details about using intrinsics. It sometime can generate code which processes one item data per time whereas it should be four items per time (SIMD - single instruction - multiple data).

Regards,
Vladimir

0 Kudos
Reply