I've been very busy reading book from Intel Press, and deciding upon the components for my new TBB development system.
Something that I feel is missing from TBB, is an equivalent to limits.h. Something that will contain information about the size of the L1, L2 cache, number of processors, and other information for compile-time tuning. These values could be used for template programming or pre-processor directives to automagically optimize data structures. I can easily envision template libraries that fine-tune themselves at compile time based on processor capabilities.
Equivalently, something that is available at runtime could be very useful. I've read about information obtainable via CPUID in the Software Optimization Cookbook. This could be a useful structure for similar optimizations, but performed at runtime instead. This way a single binary could be moved around different systems, but still be capable of best utilizing resources available to it.
The real-time approach would be probably more useful than a fixed limits.h style approach; otherwise you'd need to compile your program on every machine you deploy it upon in order to tune for that specific architecture. But adaptability is certainly a concern of ours as the panoply of available architectures and configurations grows ever broader.
And to a certain degree, TBB already addresses that. Want to know the number of processors? Use tbb::task_scheduler_init::default_num_threads(). Want to tune your code to fit within some particular cache size? You don't really need to know the cache size itself as long as you can use blocked_ranges that are splitable and whose minimum size/grainularity is small enough to fit within your innermost cache: the natural recursive subdivision of the parallel_for combined with the task stealing behavior of the TBB scheduler will enforce the motto "Work depth first; steal breadth first." If your algorithm cannot be fit into the strictures ofa blocked_range, keep to the principles of splitable work that localizes to smaller address ranges and fitting into the available cache regardless of the specific architecture should be a natural outcome. If all this fails, you may need to resort to something like CPUID. There are snippets of code out there that handle the task for specific processors and operating systems, but these lie outside the scope of this forum.
Currently I am experimenting with the design of new data structures for my simulation framework, and it seems to me that knowing the cache size at compile time or runtime will allow me to ensure my data structures are allocated properly to best use the cache.
We want to keep TBB design cache-oblivious - we care about cache locality but do not worry about particular cache sizes. I only can think of one TBB algorithm where knowing cache size can reap some benefit: in pipeline, the number of simultaneously processed tokens can be limited so that they all fit into available cache; it's not something TBB can do because average token size should be known, but a user can. On the other hand, providing a public API call for cache size detection would hamper TBB portability to some extent; definitely the TBB team at Intel can do it right for Intel processors; but who would do that for other platforms?
How do you think you would use the cache size info if you had it atruntime?
So far as a public API, I think this kinda thing would be best placed into tbb::internal. It's more the kinda thing that extensions to TBB could use for implementations. I see this information being used by other data structures that are provided to the user, so for instance a tbb::concurrent_set might need that information for efficient implementation.
For a motivating use case, I'm thinking about the design of data structures which could use the cache line size for performance and not just to prevent false sharing. I'm still thinking on it. I thought this would already be in TBB somewhere, since the cache_aligned_allocator must know the size of the cache lines to avoid false sharing. I'm not entirely sure how I would use this information, still thinking on it. I kinda envision a data structure where I use cache lines directly... still thinking on how this would work, and still learning :-)
In "The Software Optimization Cookbook" the cache line size is used to improve the layout of a struct. This is definitely a non-portable solution, and somewhere that I thought having a limits.h style file could help to layout the struct with C++ templates or pre-processor directives.
Well, for cache line size it could be simpler, because it can be rather safely approximated by a constant.
In fact, for the moment TBB uses a constant set to 128 (bytes) for cache_aligned_allocator. The setting is good enoughbecause itis not less than actual cache line size for our commercially supported platforms, and thus padding is sufficient, though excessive.
This cound be changed to CPUID-based detection for Intel processors, and left as is for other HW. However even such small change will have some impact at runtime, while for the moment it's fully compile time -so it's still a trade-off and thus there should be some evidence that the change will make improvement in some important places (and where it is less important or does not improve, we could still use the constant).
static size_t NFS_LineSize = 128;
(Added after next post) Also see "Assert in initMemoryManager under PPC64" of a month and a half ago.
Realistically, this is a thought that I had but I don't have something that requires this in my hands right now. In my mind, I see this as an opportunity to do template programming with cache information to really customize data structures to use the processor in the best possible manner at runtime.
I thought that I could do something like create a data structure that is composed of an array of structures like this:
where X+Y = cache line size. I might have some information that the algorithm uses in the header_data field, and actual data in the data field. For instance, the header_data might describe the data field in some way for a traversal. Ideally, I could use template programming to generate these cache line sized chunks at compile time to best use the cache. I'm thinking that this kinda system could be used to automatically construct the low-level details of a data structure efficiently without having to worry about cache size changes. This way a simple re-compile on the target architecture gets you the best performance.
I'm thinking along the lines of building algorithms that use smaller structures that fit into the cache line, rather than larger structures which are distributed around in memory. Traversals using this method benefit from potential gains from memory pre-fetching.
I don't have a solid use case right now, I'm still in the very hard process of figuring out the algorithms that would actually use this. I think I could use this piece of information to build a parallel tree without pointers. Still thinking.
Alternatively, what happens when you add this management information that needs to be dealt with dynamically is that you add administrative overhead, which you may or may not be able to amortize over any benefits in performance that might be gained through fitting into a particular machine's cache line.
What seems to me to be less overhead with the same attention to locality is to try to arrange your data andwrite your algorithms so that they will fit within the smallest cache line you're likely to encounter (probably 32 bytes), and then rely on decimation schemes like blocked ranges in parallel_for to gang blocks of cache lines together, mostly to reduce locality management overhead. Once you're beyond the size of a cache line or two, there's no general advantage to having adjacent cache lines processed by the same HW thread. There may be algorithmic cases where data relationships mimic memory locality (like octree processing) that might further gain from such localities, but these are not particularly amenable to a general solution.
And that is the bottom line: cache-line fitting has more to do with the data relationships of a particular algorithm than it does with anything else. Careful blocking is a requirement in DGEMM, the BLAS double precision matrix multiply implemented in Intel Math Kernel Library and other places. And I know of at least one physics code where each element of the principal array takes over two cache lines to hold, even though various kernels processing it may only use a few fields in each element. It would be a wonderful performance boost to refactor it, but the cost of rewriting millions of lines of legacy code is prohibitive.