Solved: Quote:Alexei Katranov (Intel)

Tim_Day · ‎01-05-2014

I have obtained good speedups using scalable allocator on various occasions in the past, but in producer consumer systems where allocations and deallocations are performed by separate task invokations (and presumably likely different threads) it can be frustrating to use due to the allocator's apparent propensity for retaining freed memory internally (it certainly doesn't technically leak it as subsequent TBB thread allocations can draw on it; however in complex systems where TBB is just used in part of a processing pipeline I often want to reclaim the store for other purposes).

Anyway, I got quite excited when I saw TBB 4.2 seemed to have a scalable_allocation_command(TBBMALLOC_CLEAN_ALL_BUFFERS,0) which looked like it might do something about this.

Attached (tbbmem.cpp) is some minimal test code.

Compiled with g++ 4.8.2 on an amd64 Debian sid using

g++ -std=c++11 -o tbbmem -march=native -O3 -g tbbmem.cpp -ltbb -ltbbmalloc

It outputs:

Pre-allocation: 0.0264 GByte process size
After parallel allocation: 1.87 GByte process size
After parallel deallocation: 1.87 GByte process size
After tbbmalloc clean: 1.87 GByte process size

Now I was hoping this newfangled clean command would somehow magically shrink the process size back down to where it started, but clearly not.

So: what does it actually do? And more importantly is there anything I can do to make the sort of situation which the code attached models release the deallocated core back to where it's available for other purposes?

Alexei_K_Intel · ‎01-13-2014

Hi,

scalable_allocation_command(TBBMALLOC_CLEAN_ALL_BUFFERS,0) is supposed to clean internal buffers and it may reduce memory consumption.

Thank you for a small issue reproducer. It helped us to observe the issue inside the TBB malloc library. We will try to fix it in one of future releases. Sorry for inconvenience.

(Some technical details: in your reproducer when many threads are used, many deallocations are called on other threads than corresponding allocations. Unfortunately, for now memory freed in such way is invisible for the TBBMALLOC_CLEAN_ALL_BUFFERS command, and prevents returning bigger blocks of memory to OS. It can however be used for new allocations, so it’s not leaked.)
If I correctly understand your needs the some kind of work-around is possible. I would like to suggest you using scalable memory pools: you may change

[cpp]tbb::scalable_allocator<payload> tbb_allocator;[/cpp]

with
[cpp]
tbb::memory_pool< std::allocator<char> > mem_pool;

tbb::memory_pool_allocator<payload> tbb_allocator ( mem_pool );
[/cpp]
and destroy it after deallocation loop, e.g. place it in some scope:
[cpp]
{

    tbb::memory_pool< std::allocator<char> > mem_pool;

    tbb::memory_pool_allocator<payload> tbb_allocator ( mem_pool );

    // allocation loop

    parallel_for...

    // dellocation loop

    parallel_for...

} // here the memory pool will be destroyed and memory will be returned back to OS.
[/cpp]
It should help you to reduce memory consumption.
You have mentioned that “TBB is just used in part of a processing pipeline” Could you explain your application model? Does it some kind of plugin model when a module with TBB is loaded when necessary? Or is “a processing pipeline” called in some outer loop and TBB is called regularly like other calculations? We would like to know it to have some imagination about scalable_allocator usage models. It can help us with future development of scalable_allocator.

Regards, Alex

View solution in original post

Alexei_K_Intel · ‎01-13-2014

Hi,

scalable_allocation_command(TBBMALLOC_CLEAN_ALL_BUFFERS,0) is supposed to clean internal buffers and it may reduce memory consumption.

Thank you for a small issue reproducer. It helped us to observe the issue inside the TBB malloc library. We will try to fix it in one of future releases. Sorry for inconvenience.

(Some technical details: in your reproducer when many threads are used, many deallocations are called on other threads than corresponding allocations. Unfortunately, for now memory freed in such way is invisible for the TBBMALLOC_CLEAN_ALL_BUFFERS command, and prevents returning bigger blocks of memory to OS. It can however be used for new allocations, so it’s not leaked.)
If I correctly understand your needs the some kind of work-around is possible. I would like to suggest you using scalable memory pools: you may change

[cpp]tbb::scalable_allocator<payload> tbb_allocator;[/cpp]

with
[cpp]
tbb::memory_pool< std::allocator<char> > mem_pool;

tbb::memory_pool_allocator<payload> tbb_allocator ( mem_pool );
[/cpp]
and destroy it after deallocation loop, e.g. place it in some scope:
[cpp]
{

    tbb::memory_pool< std::allocator<char> > mem_pool;

    tbb::memory_pool_allocator<payload> tbb_allocator ( mem_pool );

    // allocation loop

    parallel_for...

    // dellocation loop

    parallel_for...

} // here the memory pool will be destroyed and memory will be returned back to OS.
[/cpp]
It should help you to reduce memory consumption.
You have mentioned that “TBB is just used in part of a processing pipeline” Could you explain your application model? Does it some kind of plugin model when a module with TBB is loaded when necessary? Or is “a processing pipeline” called in some outer loop and TBB is called regularly like other calculations? We would like to know it to have some imagination about scalable_allocator usage models. It can help us with future development of scalable_allocator.

Regards, Alex

jimdempseyatthecove · ‎01-13-2014

I think what you are observing is on first allocation the Virtual Memory footprint enlarges.
On clean all buffers the Virtual Memory footprint does not reduce.

The returned memory may now exist within one of the process's heaps (Linux may only have one per process), thus not appear as a reduction of process footprint (as the unused memory is now free nodes within a heap as well as part of the virtual memory of the process).

What you need to do is see what happens on subsequent iterations. Note, fragmentation can occur depending on allocation order.

Last note, once a process starts, heaps tend to grow not shrink. Try increasing your page file size (unused heap may get paged out).

Jim Dempsey

Tim_Day · ‎01-15-2014

Alexei Katranov (Intel) wrote:

Thank you for a small issue reproducer. It helped us to observe the issue inside the TBB malloc library. We will try to fix it in one of future releases. Sorry for inconvenience.

If I correctly understand your needs the some kind of work-around is possible. I would like to suggest you using scalable memory pools: you may change

You have mentioned that “TBB is just used in part of a processing pipeline” Could you explain your application model?

Thanks for the response and confirming there is something amiss/unintended with current behaviour. I'd certainly look forward to an eventual fix but in the meanwhile the suggestion to try a scalable pool is a good one; will report back once tried. (It's actually been a few years since I last attempted to use scalable_allocator and I've only recently started using TBB 4.2; I'm pretty sure pools weren't available the last time I looked at this if they need a #define TBB_PREVIEW_MEMORY_POOL in 4.2).

The application area of interest to me is rendering of procedurally generated simulated/virtual environments. A commonly recurring pattern is parallelisation over some (possibly non-uniform) spatial subdivision scheme which allocates a large number of objects in a world (and benefits significantly from scalable_allocator), followed by sorting/binning of those objects into the rendered view space, transformation or decoration of them with additional cached rendering information (more allocs/frees) and parallelisation over view tiles. This easily gives rise to the kind of allocation/free pattern illustrated in my sample code. I've only recently started revisiting this code and remembered running into and being confused by this scalable_allocator behaviour before; I have a better idea of what's going on though now, so may be able to work round it better whether using pool (as suggested) or ironically more use of scalable_allocator in the "consumer" rendering tasks to efficiently reuse memory retained by scalable_allocator (basically my algorithms can use all the RAM they can get in both the scene creation and rendering passes, so having a significant amount locked away in scalable_allocator reduces the complexity of world I can create/render).

Tim_Day · ‎01-15-2014

jimdempseyatthecove wrote:

I think what you are observing is on first allocation the Virtual Memory footprint enlarges. On clean all buffers the Virtual Memory footprint does not reduce. The returned memory may now exist within one of the process's heaps (Linux may only have one per process), thus not appear as a reduction of process footprint (as the unused memory is now free nodes within a heap as well as part of the virtual memory of the process).

No, I've done enough testing of this problem to know that the memory is very much tied up in TBB somewhere, and not available for other (non-scalable_allocator) allocations. Alexei's post confirms the issue is real.

Tim

What is expected effect of scalable_allocation_command's buffer cleaning?