- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Experts:
I think it should a very tough question here. I list the code below, where chunks=94 and fftLen=8192.
for(Int i=0; i<chunks; i++)
{ ......
ippsAdd_32f(data2, data3+i*fftLen, data2, fftLen);
}
This piece of code exists in two projects, but have quite different behavior. In first project, it only cost about 0.2ms, but in second project it cost about 1ms.
I try to changed the code in second project as:
for(Int i=0; i<chunks; i++)
{ .......
ippsAdd_32f(data2, data3, data2, fftLen);
}
Then the time elapsed of the second project changed from 1ms to 0.2ms.
I can understand moving the data needs time. But I feel confused that why in first project everthing is fine?
I appreciate your expert view on that.
Best Regards,
Sun Cao
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Guess an order of calculations and data flow is different in these 2 projects - performance of Add (or any other function) highly depends on data locality - L0, MLC or LLC. So I think in the first case data3 is closer to L0 than in the second. When you remove travelling through data3 - starting from the 2nd iteration you have all data in L0 and therefore an ideal performance.
Use next numbers for rough estimation: load latency for data in L0(32K) - 4-5 clocks; MLC(256K) - 10-12 clocks; LLC(2M per core) - 25-36 clocks; 200 clocks for LLC miss penalty.
regards, Igor
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Guess an order of calculations and data flow is different in these 2 projects - performance of Add (or any other function) highly depends on data locality - L0, MLC or LLC. So I think in the first case data3 is closer to L0 than in the second. When you remove travelling through data3 - starting from the 2nd iteration you have all data in L0 and therefore an ideal performance.
Use next numbers for rough estimation: load latency for data in L0(32K) - 4-5 clocks; MLC(256K) - 10-12 clocks; LLC(2M per core) - 25-36 clocks; 200 clocks for LLC miss penalty.
regards, Igor
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Igor:
I can understand the data locality will affect the performance. But it can not explain why the first project is fine.
Best Regards,
Sun Cao
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Sun Cao,
as I've already said above - I guess that these 2 projects are different and have different order of calculations and different data flows - so some other data extrudes data3 vector from cache. There is no enough information to provide you another answer. The only possible way for more deep analysis is to provide a reproducible that shows 2 performances with 5x difference. what IPP library do you use - threaded or not?
regards, Igor
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page