Intel® Integrated Performance Primitives
Community support and discussions relating to developing high-performance vision, signal, security, and storage applications.
Announcements
Welcome to the Intel Community. If you get an answer you like, please mark it as an Accepted Solution to help others. Thank you!

Significant slow down of Intel IPP

caosun
New Contributor I
142 Views

Hi Experts:

    I think it should a very tough question here. I list the code below, where chunks=94 and fftLen=8192.

for(Int i=0; i<chunks; i++)
{ ......
ippsAdd_32f(data2, data3+i*fftLen, data2, fftLen);
}

This piece of code exists in two projects, but have quite different behavior. In first project, it only cost about 0.2ms, but in second project it cost about 1ms.

I try to changed the code in second project as: 

for(Int i=0; i<chunks; i++)
{ .......
ippsAdd_32f(data2, data3, data2, fftLen);
}

Then the time elapsed of the second project changed from 1ms to 0.2ms.

I can understand moving the data needs time. But I feel confused that why in first project everthing is fine?

I appreciate your expert view on that.

Best Regards,

Sun Cao

0 Kudos
6 Replies
Igor_A_Intel
Employee
142 Views

Guess an order of calculations and data flow is different in these 2 projects - performance of Add (or any other function) highly depends on data locality - L0, MLC or LLC. So I think in the first case data3 is closer to L0 than in the second. When you remove travelling through data3 - starting from the 2nd iteration you have all data in L0 and therefore an ideal performance.

 Use next numbers for rough estimation: load latency for data in L0(32K) - 4-5 clocks; MLC(256K) - 10-12 clocks; LLC(2M per core) - 25-36 clocks; 200 clocks for LLC miss penalty.

regards, Igor

Igor_A_Intel
Employee
142 Views

Guess an order of calculations and data flow is different in these 2 projects - performance of Add (or any other function) highly depends on data locality - L0, MLC or LLC. So I think in the first case data3 is closer to L0 than in the second. When you remove travelling through data3 - starting from the 2nd iteration you have all data in L0 and therefore an ideal performance.

 Use next numbers for rough estimation: load latency for data in L0(32K) - 4-5 clocks; MLC(256K) - 10-12 clocks; LLC(2M per core) - 25-36 clocks; 200 clocks for LLC miss penalty.

regards, Igor

caosun
New Contributor I
142 Views

Hi Igor:

    I can understand the data locality will affect the performance. But it can not explain why the first project is fine.

Best Regards,

Sun Cao

SergeyKostrov
Valued Contributor II
142 Views
>>...I can understand the data locality will affect the performance. But it can not explain why the first project is fine. Here are a couple of advises: 1. Check allignment of your data 2. Check project settings 3. You don't take into account a time for calculation of offset: ... [ 1ms ] ippsAdd_32f( data2, data3+i*fftLen, data2, fftLen ); ... [ 0.2ms ] ippsAdd_32f( data2, data3, data2, fftLen ); ... and try to declare a local fftLen variable as closer as possible to a call to ippsAdd_32f function or use a constant 8192 instead ( if fftLen is Not changing ).
Igor_A_Intel
Employee
142 Views

Hi Sun Cao,

as I've already said above - I guess that these 2 projects are different and have different order of calculations and different data flows - so some other data extrudes data3 vector from cache. There is no enough information to provide you another answer. The only possible way for more deep analysis is to provide a reproducible that shows 2 performances with 5x difference. what IPP library do you use - threaded or not?

regards, Igor

SergeyKostrov
Valued Contributor II
142 Views
>>... The only possible way for more deep analysis is to provide a reproducible that shows 2 performances with 5x difference... Sun Cao, Do you have VTune Amplifier XE? If Yes, do a Hotspots Anaysis, or another one, to undesrtand what could be wrong.
Reply