Mithum,
The purpose for the article was to highlight the shared nature of cache hierarchy in processors that support Hyper-Threading Technology. That with two threads sharing the same cache hierarchy, the effective available cache to each logical processor is reduced. An application that is using cache blocking should detect for processors supportingHyper-Threading technology and reduce the block size appropriately.
As a generalguideline to start from, I've recommended that cache blocking techniques target ~50% of the cache size for processors without Hyper-Threading technology enabled. If50% was areasons block size without Hyper-Threading, thenrunning the same application but with 2 threads on a Hyper-Threading enabled processor should target ~25-35% of the cache size. The optimizal cache blocking is highly application dependent and significantly influenced by other processes that may be running as well.
Certaintly, the set-associativity plays a part in both the L2 and L3 cache behavior / performance. There are cases where you can effectively increase (or inadvertantly decrease) the cache performanceby utilizing knowledge of the set-associativity and fine tuning the applications access behavior. Unfortunatley, I don't have any specific data on this.
By extension, the cache blocking technique can be applied to the L3 instead of the L2 cache but is again application dependent. Beware that applying the cache blocking technique to L3 cache can run into other performance related bottlenecks. For example, the number of entries in the DTLB may also limit the effect size of the block by causing DTLB misses if the block size is too large. While this isn't as likely with an L2 cache size of 512K, it can be an issue with larger cache sizes found in L3 caches.
I hope this helps.
Regards,
Phil Kerly