Equivalent of limits.h for Processor information in TBB

AJ13 · ‎05-01-2008

Hey all,

I've been very busy reading book from Intel Press, and deciding upon the components for my new TBB development system.

Something that I feel is missing from TBB, is an equivalent to limits.h. Something that will contain information about the size of the L1, L2 cache, number of processors, and other information for compile-time tuning. These values could be used for template programming or pre-processor directives to automagically optimize data structures. I can easily envision template libraries that fine-tune themselves at compile time based on processor capabilities.

Equivalently, something that is available at runtime could be very useful. I've read about information obtainable via CPUID in the Software Optimization Cookbook. This could be a useful structure for similar optimizations, but performed at runtime instead. This way a single binary could be moved around different systems, but still be capable of best utilizing resources available to it.

Thoughts?

AJ

robert-reed · ‎05-01-2008

The real-time approach would be probably more useful than a fixed limits.h style approach; otherwise you'd need to compile your program on every machine you deploy it upon in order to tune for that specific architecture. But adaptability is certainly a concern of ours as the panoply of available architectures and configurations grows ever broader.

And to a certain degree, TBB already addresses that. Want to know the number of processors? Use tbb::task_scheduler_init::default_num_threads(). Want to tune your code to fit within some particular cache size? You don't really need to know the cache size itself as long as you can use blocked_ranges that are splitable and whose minimum size/grainularity is small enough to fit within your innermost cache: the natural recursive subdivision of the parallel_for combined with the task stealing behavior of the TBB scheduler will enforce the motto "Work depth first; steal breadth first." If your algorithm cannot be fit into the strictures ofa blocked_range, keep to the principles of splitable work that localizes to smaller address ranges and fitting into the available cache regardless of the specific architecture should be a natural outcome. If all this fails, you may need to resort to something like CPUID. There are snippets of code out there that handle the task for specific processors and operating systems, but these lie outside the scope of this forum.

RafSchietekat · ‎05-01-2008

The thing is that you don't know the suitable grain size for a blocked range unless by experimenting... which is done at development time (before compile time even), unless auto_partitioner is used. (How robust is auto_partitioner? It cannot split a range once execution has started, so how well can it handle situations where not all regions take approximately equal time to compute, imagine a really pathological case of the included example that demonstrates different kinds of blocked ranges?) On the other hand, it seems doubtful that knowing the cache size would lead directly to good conclusions about suitable grain size, except by performing even more experiments at development time and interpolating at run time. But I think all this should be driven by well-motivated examples, not speculation.

AJ13 · ‎05-02-2008

I have asked this question after reading a couple of the books from Intel press on multicore development and software optimization. I agree that at a high level, such details should be hidden away from the user, and they should use standard components which are already optimized for them. However, at a low level such information seems critical for proper implementation from what I have read. For example, I haven't examined the source but surely concurrent_vector or the cache_aligned_allocator uses this information under the hood.

Currently I am experimenting with the design of new data structures for my simulation framework, and it seems to me that knowing the cache size at compile time or runtime will allow me to ensure my data structures are allocated properly to best use the cache.

Alexey-Kukanov · ‎05-02-2008

We want to keep TBB design cache-oblivious - we care about cache locality but do not worry about particular cache sizes. I only can think of one TBB algorithm where knowing cache size can reap some benefit: in pipeline, the number of simultaneously processed tokens can be limited so that they all fit into available cache; it's not something TBB can do because average token size should be known, but a user can. On the other hand, providing a public API call for cache size detection would hamper TBB portability to some extent; definitely the TBB team at Intel can do it right for Intel processors; but who would do that for other platforms?

How do you think you would use the cache size info if you had it atruntime?

AJ13 · ‎05-05-2008

I agree, I have also thought about the portability issue with respect to cache line sizes.

So far as a public API, I think this kinda thing would be best placed into tbb::internal. It's more the kinda thing that extensions to TBB could use for implementations. I see this information being used by other data structures that are provided to the user, so for instance a tbb::concurrent_set might need that information for efficient implementation.

For a motivating use case, I'm thinking about the design of data structures which could use the cache line size for performance and not just to prevent false sharing. I'm still thinking on it. I thought this would already be in TBB somewhere, since the cache_aligned_allocator must know the size of the cache lines to avoid false sharing. I'm not entirely sure how I would use this information, still thinking on it. I kinda envision a data structure where I use cache lines directly... still thinking on how this would work, and still learning :-)

In "The Software Optimization Cookbook" the cache line size is used to improve the layout of a struct. This is definitely a non-portable solution, and somewhere that I thought having a limits.h style file could help to layout the struct with C++ templates or pre-processor directives.

Alexey-Kukanov · ‎05-07-2008

Well, for cache line size it could be simpler, because it can be rather safely approximated by a constant.

In fact, for the moment TBB uses a constant set to 128 (bytes) for cache_aligned_allocator. The setting is good enoughbecause itis not less than actual cache line size for our commercially supported platforms, and thus padding is sufficient, though excessive.

This cound be changed to CPUID-based detection for Intel processors, and left as is for other HW. However even such small change will have some impact at runtime, while for the moment it's fully compile time -so it's still a trade-off and thus there should be some evidence that the change will make improvement in some important places (and where it is less important or does not improve, we could still use the constant).

RafSchietekat · ‎05-07-2008

#define CACHE_LINE_SIZE 64
static size_t NFS_LineSize = 128;
Confusing...

(Added after next post) Also see "Assert in initMemoryManager under PPC64" of a month and a half ago.

Alexey-Kukanov · ‎05-07-2008

The former is from the TBB scalable memory allocator (aka tbbmalloc) library code, which is rather disjoint from the main TBB code. And the needs these constants satisfy are different. And both are internal, invisible to users. Nevertheless, the former will soon "advance" to 128 as well, dictated by performance reasons. And it will possibly be renamed to make less confusion.

AJ13 · ‎05-08-2008

It's the cache line size I meant, sorry. This could be generated at TBB make time and put into a header, which could be used to drive templates for specific processors. I'm currently attempting to design data structures to use cache lines as effectively as possible.

robert-reed · ‎05-09-2008

Well, if you're considering cache line size in a case where it actually makes a difference in your application'sperformance, then you'll also need to note whether the hardwareadjacent sector prefetch is enabled, because it will more or less make a 64-byte cache line look like a 128-byte cache line, at least for reads. I'm not sure what you mean by, "...generated at TBB make time and put into a header..." but it suggests multiple headers per architecture to handle different configurations. Currently the machine specific headers differentiate on OS and major architectural class (32- vs. 64-bit, etc.) but don't make the subtle gradations your suggestion seems to imply. Are you really going to try to support 32-byte, 64-byte and 128-byte cache line sizes optimally? I might suggest some careful experiments first to see what you can gain before going into a full-blown implementation.

AJ13 · ‎05-12-2008

What I meant by that, is that when TBB is being compiled for a machine it could extract cache information and populate the TBB limits.h file with constants. Then at compile time this information could be used.

Realistically, this is a thought that I had but I don't have something that requires this in my hands right now. In my mind, I see this as an opportunity to do template programming with cache information to really customize data structures to use the processor in the best possible manner at runtime.

I thought that I could do something like create a data structure that is composed of an array of structures like this:

struct cache_line_data
{
int header_data;
int data;
};

where X+Y = cache line size. I might have some information that the algorithm uses in the header_data field, and actual data in the data field. For instance, the header_data might describe the data field in some way for a traversal. Ideally, I could use template programming to generate these cache line sized chunks at compile time to best use the cache. I'm thinking that this kinda system could be used to automatically construct the low-level details of a data structure efficiently without having to worry about cache size changes. This way a simple re-compile on the target architecture gets you the best performance.

I'm thinking along the lines of building algorithms that use smaller structures that fit into the cache line, rather than larger structures which are distributed around in memory. Traversals using this method benefit from potential gains from memory pre-fetching.

I don't have a solid use case right now, I'm still in the very hard process of figuring out the algorithms that would actually use this. I think I could use this piece of information to build a parallel tree without pointers. Still thinking.

robert-reed · ‎05-12-2008

Alternatively, what happens when you add this management information that needs to be dealt with dynamically is that you add administrative overhead, which you may or may not be able to amortize over any benefits in performance that might be gained through fitting into a particular machine's cache line.

What seems to me to be less overhead with the same attention to locality is to try to arrange your data andwrite your algorithms so that they will fit within the smallest cache line you're likely to encounter (probably 32 bytes), and then rely on decimation schemes like blocked ranges in parallel_for to gang blocks of cache lines together, mostly to reduce locality management overhead. Once you're beyond the size of a cache line or two, there's no general advantage to having adjacent cache lines processed by the same HW thread. There may be algorithmic cases where data relationships mimic memory locality (like octree processing) that might further gain from such localities, but these are not particularly amenable to a general solution.

And that is the bottom line: cache-line fitting has more to do with the data relationships of a particular algorithm than it does with anything else. Careful blocking is a requirement in DGEMM, the BLAS double precision matrix multiply implemented in Intel Math Kernel Library and other places. And I know of at least one physics code where each element of the principal array takes over two cache lines to hold, even though various kernels processing it may only use a few fields in each element. It would be a wonderful performance boost to refactor it, but the cost of rewriting millions of lines of legacy code is prohibitive.