Hi Experts:
I think it should a very tough question here. I list the code below, where chunks=94 and fftLen=8192.
for(Int i=0; i<chunks; i++)
{ ......
ippsAdd_32f(data2, data3+i*fftLen, data2, fftLen);
}
This piece of code exists in two projects, but have quite different behavior. In first project, it only cost about 0.2ms, but in second project it cost about 1ms.
I try to changed the code in second project as:
for(Int i=0; i<chunks; i++)
{ .......
ippsAdd_32f(data2, data3, data2, fftLen);
}
Then the time elapsed of the second project changed from 1ms to 0.2ms.
I can understand moving the data needs time. But I feel confused that why in first project everthing is fine?
I appreciate your expert view on that.
Best Regards,
Sun Cao
Guess an order of calculations and data flow is different in these 2 projects - performance of Add (or any other function) highly depends on data locality - L0, MLC or LLC. So I think in the first case data3 is closer to L0 than in the second. When you remove travelling through data3 - starting from the 2nd iteration you have all data in L0 and therefore an ideal performance.
Use next numbers for rough estimation: load latency for data in L0(32K) - 4-5 clocks; MLC(256K) - 10-12 clocks; LLC(2M per core) - 25-36 clocks; 200 clocks for LLC miss penalty.
regards, Igor
Guess an order of calculations and data flow is different in these 2 projects - performance of Add (or any other function) highly depends on data locality - L0, MLC or LLC. So I think in the first case data3 is closer to L0 than in the second. When you remove travelling through data3 - starting from the 2nd iteration you have all data in L0 and therefore an ideal performance.
Use next numbers for rough estimation: load latency for data in L0(32K) - 4-5 clocks; MLC(256K) - 10-12 clocks; LLC(2M per core) - 25-36 clocks; 200 clocks for LLC miss penalty.
regards, Igor
Hi Igor:
I can understand the data locality will affect the performance. But it can not explain why the first project is fine.
Best Regards,
Sun Cao
Hi Sun Cao,
as I've already said above - I guess that these 2 projects are different and have different order of calculations and different data flows - so some other data extrudes data3 vector from cache. There is no enough information to provide you another answer. The only possible way for more deep analysis is to provide a reproducible that shows 2 performances with 5x difference. what IPP library do you use - threaded or not?
regards, Igor
For more complete information about compiler optimizations, see our Optimization Notice.