Intel® Integrated Performance Primitives
Deliberate problems developing high-performance vision, signal, security, and storage applications.

Significant slow down of Intel IPP

caosun
New Contributor I
1,599 Views

Hi Experts:

    I think it should a very tough question here. I list the code below, where chunks=94 and fftLen=8192.

for(Int i=0; i<chunks; i++)
{ ......
ippsAdd_32f(data2, data3+i*fftLen, data2, fftLen);
}

This piece of code exists in two projects, but have quite different behavior. In first project, it only cost about 0.2ms, but in second project it cost about 1ms.

I try to changed the code in second project as: 

for(Int i=0; i<chunks; i++)
{ .......
ippsAdd_32f(data2, data3, data2, fftLen);
}

Then the time elapsed of the second project changed from 1ms to 0.2ms.

I can understand moving the data needs time. But I feel confused that why in first project everthing is fine?

I appreciate your expert view on that.

Best Regards,

Sun Cao

0 Kudos
6 Replies
Igor_A_Intel
Employee
1,599 Views

Guess an order of calculations and data flow is different in these 2 projects - performance of Add (or any other function) highly depends on data locality - L0, MLC or LLC. So I think in the first case data3 is closer to L0 than in the second. When you remove travelling through data3 - starting from the 2nd iteration you have all data in L0 and therefore an ideal performance.

 Use next numbers for rough estimation: load latency for data in L0(32K) - 4-5 clocks; MLC(256K) - 10-12 clocks; LLC(2M per core) - 25-36 clocks; 200 clocks for LLC miss penalty.

regards, Igor

0 Kudos
Igor_A_Intel
Employee
1,599 Views

Guess an order of calculations and data flow is different in these 2 projects - performance of Add (or any other function) highly depends on data locality - L0, MLC or LLC. So I think in the first case data3 is closer to L0 than in the second. When you remove travelling through data3 - starting from the 2nd iteration you have all data in L0 and therefore an ideal performance.

 Use next numbers for rough estimation: load latency for data in L0(32K) - 4-5 clocks; MLC(256K) - 10-12 clocks; LLC(2M per core) - 25-36 clocks; 200 clocks for LLC miss penalty.

regards, Igor

0 Kudos
caosun
New Contributor I
1,599 Views

Hi Igor:

    I can understand the data locality will affect the performance. But it can not explain why the first project is fine.

Best Regards,

Sun Cao

0 Kudos
SergeyKostrov
Valued Contributor II
1,599 Views
>>...I can understand the data locality will affect the performance. But it can not explain why the first project is fine. Here are a couple of advises: 1. Check allignment of your data 2. Check project settings 3. You don't take into account a time for calculation of offset: ... [ 1ms ] ippsAdd_32f( data2, data3+i*fftLen, data2, fftLen ); ... [ 0.2ms ] ippsAdd_32f( data2, data3, data2, fftLen ); ... and try to declare a local fftLen variable as closer as possible to a call to ippsAdd_32f function or use a constant 8192 instead ( if fftLen is Not changing ).
0 Kudos
Igor_A_Intel
Employee
1,599 Views

Hi Sun Cao,

as I've already said above - I guess that these 2 projects are different and have different order of calculations and different data flows - so some other data extrudes data3 vector from cache. There is no enough information to provide you another answer. The only possible way for more deep analysis is to provide a reproducible that shows 2 performances with 5x difference. what IPP library do you use - threaded or not?

regards, Igor

0 Kudos
SergeyKostrov
Valued Contributor II
1,599 Views
>>... The only possible way for more deep analysis is to provide a reproducible that shows 2 performances with 5x difference... Sun Cao, Do you have VTune Amplifier XE? If Yes, do a Hotspots Anaysis, or another one, to undesrtand what could be wrong.
0 Kudos
Reply