The real-time approach would be probably more useful than a fixed limits.h style approach; otherwise you'd need to compile your program on every machine you deploy it upon in order to tune for that specific architecture. But adaptability is certainly a concern of ours as the panoply of available architectures and configurations grows ever broader.
And to a certain degree, TBB already addresses that. Want to know the number of processors? Use tbb::task_scheduler_init::default_num_threads(). Want to tune your code to fit within some particular cache size? You don't really need to know the cache size itself as long as you can use blocked_ranges that are splitable and whose minimum size/grainularity is small enough to fit within your innermost cache: the natural recursive subdivision of the parallel_for combined with the task stealing behavior of the TBB scheduler will enforce the motto "Work depth first; steal breadth first." If your algorithm cannot be fit into the strictures ofa blocked_range, keep to the principles of splitable work that localizes to smaller address ranges and fitting into the available cache regardless of the specific architecture should be a natural outcome. If all this fails, you may need to resort to something like CPUID. There are snippets of code out there that handle the task for specific processors and operating systems, but these lie outside the scope of this forum.
We want to keep TBB design cache-oblivious - we care about cache locality but do not worry about particular cache sizes. I only can think of one TBB algorithm where knowing cache size can reap some benefit: in pipeline, the number of simultaneously processed tokens can be limited so that they all fit into available cache; it's not something TBB can do because average token size should be known, but a user can. On the other hand, providing a public API call for cache size detection would hamper TBB portability to some extent; definitely the TBB team at Intel can do it right for Intel processors; but who would do that for other platforms?
How do you think you would use the cache size info if you had it atruntime?
Well, for cache line size it could be simpler, because it can be rather safely approximated by a constant.
In fact, for the moment TBB uses a constant set to 128 (bytes) for cache_aligned_allocator. The setting is good enoughbecause itis not less than actual cache line size for our commercially supported platforms, and thus padding is sufficient, though excessive.
This cound be changed to CPUID-based detection for Intel processors, and left as is for other HW. However even such small change will have some impact at runtime, while for the moment it's fully compile time -so it's still a trade-off and thus there should be some evidence that the change will make improvement in some important places (and where it is less important or does not improve, we could still use the constant).
Alternatively, what happens when you add this management information that needs to be dealt with dynamically is that you add administrative overhead, which you may or may not be able to amortize over any benefits in performance that might be gained through fitting into a particular machine's cache line.
What seems to me to be less overhead with the same attention to locality is to try to arrange your data andwrite your algorithms so that they will fit within the smallest cache line you're likely to encounter (probably 32 bytes), and then rely on decimation schemes like blocked ranges in parallel_for to gang blocks of cache lines together, mostly to reduce locality management overhead. Once you're beyond the size of a cache line or two, there's no general advantage to having adjacent cache lines processed by the same HW thread. There may be algorithmic cases where data relationships mimic memory locality (like octree processing) that might further gain from such localities, but these are not particularly amenable to a general solution.
And that is the bottom line: cache-line fitting has more to do with the data relationships of a particular algorithm than it does with anything else. Careful blocking is a requirement in DGEMM, the BLAS double precision matrix multiply implemented in Intel Math Kernel Library and other places. And I know of at least one physics code where each element of the principal array takes over two cache lines to hold, even though various kernels processing it may only use a few fields in each element. It would be a wonderful performance boost to refactor it, but the cost of rewriting millions of lines of legacy code is prohibitive.