- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'm starting to work on the finer details of YetiSim, and one of the details is memory management. I probably need to revise my design, however for now I am looking at what data structures I can use to increase performance.
I'm looking at the difference between say std::list and std::vector. It seems to be that an implementation of std::list that used a linked list would kill the cache. However, if I pass the cache_aligned_allocator is this sufficient to make the container cache friendly? Are all STL containers cache friendly if they use the cache_aligned_allocator? Which ones should I prefer or avoid ?
Are there negative performance implications if I were to simply override new/delete with the scalable_allocator, and use the cache_aligned_allocator for my containers ?
Thanks,
AJ
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
While it is certainly true that traversing a std::list walks through memory and might cause the displacement of cache lines that have to be recovered later, thus forcing them out prematurely, expanding a std::vector will eventually require it to be relocated in memory, invalidating all cached copies at its old location, and so poses its own problems for optimal cache line reuse. Any design that accumulates data in some structure for later processing is a possible site for cache cooling. Perhaps some of these cases are unavoidable, such as the structures you might consider to represent the simulation network.
Does using cache_aligned_allocator make containers cache friendly? No, it makes them cache line aligned. This can be important if you're doing things like applying vector operations to the contents, where the penalties for unaligned accesses may be severe. It doesn't do much for "cache friendly" concerns, if by that you mean either keeping needed data from going cold through cache line eviction or avoiding cache line thrashing via such mechanisms as false sharing.
If you are doing object creation across a set of threads, using the scalable allocator will improve performance by trying to keep much of the memory block management local to the thread that is using it. Further restricting that to the cache_aligned_allocator gains you the benefits of the scalable allocator (which underlies the cache_aligned_allocator) but only makes sense if there's some specific advantage to aligning the data.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The first problem, is that the divide and conqueror algorithm is hitting its worst performance case when simulation time is incremented evenly... I am copying everything out once and copying it back once. Another problem is that I am using std::vector, so I'm going to investigate migrating to concurrent_vector in critical sections... or perhaps globally to measure performance. I've learned from this that one scheduler will not fit all simulations, I will be adjusting my design to factor out the scheduler and allow the user to pop in/out their own schedulers. I'm going to develop a library of schedulers for different simulation types, and let the user select which one they want.
I've heard a lot about VTune, so I'm going to try it today to investigate performance. I think I may need to revise my design to consider caching issues. I migrated to the scalable_allocator, and moved my STL containers to cache_aligned_allocators, but the performance didn't improve much. If you are interested, you can check the code out of SVN and try the sample simulation, run it with make check. Check out 'svn co http://svn.yetisim.org/svn/yetisim/trunk/yetisim'. The file src/test.cpp has a number of entities which are created, you can adjust that number higher to see performance.
For now, I will try concurrent_vector and investigate how I can merge simulation results back in parallel, since that is a bottleneck too... it's proceeding serially right now.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Linked structures are generally not friendly to modern processors, because for large data structures that do not fit in cache, following each link usually requires loading a new cache line. Therefore, I always try to get by with vectors instead of lists if I can still get the same asymptotic running time. (I.e., if a linked list delivers O(n) time, and the array is O(n^2), that's a bad deal.) The attraction of linked lists is often that you can append an item in O(1) time. Most implementations of std::vector can append an item in O(1) amortized time, so it usually does just as well asymptotically, and is much friendlier to cache.
A related example that Scott Meyers pointed out to me is searching. He noted that in many cases, binary search over an array beats hashing with linked buckets, because of the cost of following links.
The TBB reference manual has ParallelMerge as an example. If results can be collected separated on each thread, then I would use std::vector instead of tbb::concurrent_vector. tbb::concurrent_vector is of interest if you really do need to have multiple threads accessing and growing the vector at the same time. But to do that, it definitely does more gymnastics than std::vector.
The developer source release for today contains some significant space/speed improvements to tbb::concurrent_vector.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Okay, each node in my state machine is a "linked" structure. I'm going to present an oversimplified version of what my node looks like:
class Node
{
std::vector
}
In reality, I'm using boost::shared_ptr not Node*. However, each node is linked. The best performance would be obtained if the entire state machine could make it into the cache. I suspect that because each node links to others via pointers, that the problem is that these nodes may be located randomly in memory. Could it be that if I controlled where the Nodes were allocated, that this would improve cache performance?
I'm plan on reading articles on caching today, and examining execution with VTune. Perhaps this is not the problem, but it seems like it might be.
AJ
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'd first profile the code carefully before pushing cache optimizations. If you do determine that cache is an issue, here's some things to try:
- Declare the fields in the node in order of decreasing size. That sometimes reduces padding. Using increasing order works too.
- If on a 64-bit machine, consider replacing pointers with 32-bit indices into a global array.
- Walk it in roughtly the order it will be walked when the state machine is run, and copy the nodes into a contiguousarray in that order. I.e., act like a copying garbage collector. An occasional break in the array does not hurt; i.e., if you run out of room, just allocate another array for more space.
- Arch
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I used gprof to gather this information, I couldn't for the life of me get VTune to work properly on my project. I've posted on the VTune forum about that.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page