Summary: reducing largeBlockCacheStep and numLargeBlockBins to 16/16 (from 8k/1k) reduced my application's peak virtual size by ~20% with no impact on runtime.
I've been experimenting with various alternative memory allocators in a large EDA application with the objective of reducing runtime. So far, TBBmalloc (4.0 update 2) has given the best results by a wide margin; however, I found that the peak memory use of the application increased quite significantly over the existing memory allocator (~25%). For better or worse, we keep a pretty close eye on the peak virtual size reported by the OS and a 25% increase was percieved as a major strike against TBB.
In searching around on this forum, I came across a post suggesting that changing some internal parameters in the allocator could help resolve an out-of-memory condition on Win32. This parameter-tuning approach worked very well for me, so I thought I'd share my results.
The post suggested tuning largeBlockCacheStep and cacheCleanupFreq. After reading through the code, I settled on dialing back the largeBlockCacheStep to reduce the likelihood of caching large blocks of memory. I also reduced numLargeBlockBins to further decrease the likelihood of caching large blocks. I took a small design and measured the peak memory for various values of largeBlockCacheStep; once I'd found the best runtime/peak memory tradeoff for that parameter, I selected that value and performed the same sweep on numLargeBlockBins.
The values I settled on were 16 and 16. I ran TBBmalloc compiled in this configuration across a suite of designs and saw the same runtime improvement as before but with the OS-reported peak memory increase reduced to 3% (from 25%). The runtime of the application on these designs ranges from minutes up to about a day, with memory use from one to twenty gigabytes. The results are consistent across the range of designs (i.e., they favour neither large nor small designs).
I have an unconfirmed theory that the large block caching adds no value for my application and that I'd see all the runtime benefit of TBBmalloc if it were disabled completely.
I would propose that it might be valuable if these parameters were tunable without having to recompile TBB, or if they could be tuned from the make command-line instead of requiring a hack of the header files.
Also, I believe it would be valuable to have the legal ranges of these values documented somewhere (e.g., comments in the header, static asserts, something). Since choosing 16/16, I've read the code in more detail and it seems that although no code behaves incorrectly for these values, there's at least one comment that is violated by reducing largeBlockCacheSize below 8K (it says something to the effect of "all allocations are at least 16K at this point" but reducing largeBlockCacheSize below 8K allows allocations to reach that line that are between 8K and 16K).
The performance of your allocator is nothing short of amazing (at least for our application, compared to the other allocators I evaluated). Thanks for a great product!
Thank you for the good words!
Yes, there are a lot of applications that are not interested in performance of large object allocation, but only is conservative memory consumption. We must better support them. As a first step, better documenting ways to disable large object caching, to move the bound between large and small objects etc is what we must do.
Personally, I dont like an approach we give a user 1024 knobs, and now she is free to tune them all, because the result is combinatorial explosion and lack of portability to another hardware and workloads. So, Im not for adding run-time parameter changing. Compile-time one looks OKin expert mode until there is no auto-tuning.