Solved: how to free memory allocated by scalable_allocator

redcat76 · ‎06-01-2009

Hi!
I have 1 worker thread that processes requests from other threads. Requests are relatively small objects ~2K each, allocated by scalable_allocator<>::allocate. Worker thread calls scalable_allocator<>::deallocate once it is done with request. When I have peak load from client threads and pending requests queue is growing (since worker thread cannot process requests as fast as they are supplied), scalable_allocator allocates memory for peak number of requests and will not retrun it to OS. AFAIK this is by design as it assumesthe threadswill be re-using the memory. Still in my case thismay result in too large memory consumption. Since peak load is very rare and I know when it can be generated I'd like to simply make scalable_allocator release its pooled memory.

Is there any way to make scalable_allocator decrease or completely release its pooled memory?

Chris_M__Thomasson · ‎06-02-2009

Quoting - redcat76

Hi!
I have 1 worker thread that processes requests from other threads. Requests are relatively small objects ~2K each, allocated by scalable_allocator<>::allocate. Worker thread calls scalable_allocator<>::deallocate once it is done with request. When I have peak load from client threads and pending requests queue is growing (since worker thread cannot process requests as fast as they are supplied), scalable_allocator allocates memory for peak number of requests and will not retrun it to OS. AFAIK this is by design as it assumesthe threadswill be re-using the memory. Still in my case thismay result in too large memory consumption. Since peak load is very rare and I know when it can be generated I'd like to simply make scalable_allocator release its pooled memory.

Is there any way to make scalable_allocator decrease or completely release its pooled memory?

In addition to what Mr. Dempsey has written, I feel that I should point out the following thread that shows a potential problem with per-thread allocation schemes, such as TBB allocator, in general:

http://software.intel.com/en-us/forums/showthread.php?t=61716

If Thread `A allocates large amount of memory `M which is subsequently freed by Thread `B, and Thread `A does not allocate any more memory, well, all those allocations which make up `M are leaked for the duration of `A lifetime. A possible solution is to allow Thread `A to periodically or episodically call a flush function that will reclaim memory on its remote-free list.

View solution in original post

jimdempseyatthecove · ‎06-02-2009

One of the TBB support people may give you a better answer to this...

The TBB allocator allocates within a process (an application). This allocation is a virtual memory association between the application and the system page file, not physical memory. Therefore there is not a concept as returning the memory to the OS (or taking memory from the OS). The only "negative" effect of the TBB allocator not "returning memory" isyour systemmay requre a larger page file. When the TBB memory is returned from the application to the TBB pool, that memory may (eventualy) get swapped out to the page file and sit there unused (until application closes). In the era of 1TB disks for < $100 a few extra megabytes of disk space should not be of too much concern.

Note, if you are running on Windows, and depending on version of Windows, you may have to up the page file size (or limit). For some lame reason MS thought Page File Size == Physical Memory Size was what everyone needs. I could never understand that.

Jim Dempsey

Chris_M__Thomasson · ‎06-02-2009

Quoting - redcat76

Hi!
I have 1 worker thread that processes requests from other threads. Requests are relatively small objects ~2K each, allocated by scalable_allocator<>::allocate. Worker thread calls scalable_allocator<>::deallocate once it is done with request. When I have peak load from client threads and pending requests queue is growing (since worker thread cannot process requests as fast as they are supplied), scalable_allocator allocates memory for peak number of requests and will not retrun it to OS. AFAIK this is by design as it assumesthe threadswill be re-using the memory. Still in my case thismay result in too large memory consumption. Since peak load is very rare and I know when it can be generated I'd like to simply make scalable_allocator release its pooled memory.

Is there any way to make scalable_allocator decrease or completely release its pooled memory?

In addition to what Mr. Dempsey has written, I feel that I should point out the following thread that shows a potential problem with per-thread allocation schemes, such as TBB allocator, in general:

http://software.intel.com/en-us/forums/showthread.php?t=61716

If Thread `A allocates large amount of memory `M which is subsequently freed by Thread `B, and Thread `A does not allocate any more memory, well, all those allocations which make up `M are leaked for the duration of `A lifetime. A possible solution is to allow Thread `A to periodically or episodically call a flush function that will reclaim memory on its remote-free list.

redcat76 · ‎06-03-2009

Hi, Jim! Thank you very much for your reply.

Quoting - jimdempseyatthecove

One of the TBB support people may give you a better answer to this...

The TBB allocator allocates within a process (an application). This allocation is a virtual memory association between the application and the system page file, not physical memory. Therefore there is not a concept as returning the memory to the OS (or taking memory from the OS). The only "negative" effect of the TBB allocator not "returning memory" isyour systemmay requre a larger page file.

Jim Dempsey

Unfortunately the problem is not that imaginary for me as VM size is also limited: when thread generating requests is too fast for worker thread to process them, I see a rapid grouth of my process's VM size and soon run out of virtual space. I launch my programunder 32-bit Windows XP so this happens at program's VM size ~2Gb.Then Iget an access violation elsewhere in the program as regular malloc returns 0.

Watching "Mem Usage" and "VM size" deltas in Task manager I calculated that 1 request costs me 5K of VM space. So if I have ~400 000 pending requests ...crash-boom-bang.

Note that with STL allocator based on MSC RT mallocfree I don't have this problem: a call to deallocate returns VM space, so the program survives even peak loads, although it works very much slower.

I introduced an artificial delay for generating thread if the number of pending requests exceeds pre-defined limit but in this case the performance drops as well as generating threads are waiting. I'd have a chance of fast performance even with peak loads if I could make scalable_allocator release reserved virtual memory (I believe in Windows implementation a call to VirtualFree with MEM_RELEASE or smth like that).

redcat76 · ‎06-03-2009

Quoting - Chris M. Thomasson

A possible solution is to allow Thread `A to periodically or episodically call a flush function that will reclaim memory on its remote-free list.

Thank you, Chris!

Actually I have seen this thread before but you made me read it more carefully. Still as far a s I could get it ended in no decision: such "flush" function does not yetexist at least in "official" scalable_allocator interface. Or am I mistaken?
I can try to compile souce code (that was also described on some thread here) and call some implementation-detail function but imho this is not a good decision.

jimdempseyatthecove · ‎06-03-2009

What you might want to consider is restrict the use of the TBB allocator to objectswith high frequency of allocation/deallocation. Then use new or STL allocator for the lower frequency allocation/deallocation. This may get you through the problem.

A second technique would be to create and use persistant objects and object pool. Then in place of deleting an object you retire it to a pool (first to a non-interlocked thread private pool, then next to an interlocked global pool). This can all be hidden away in your own allocator and then you have the capability to fine tune the allocation/deallocation as a trade-off of footprint vs overhead. And you can slip this into your existing code without too much effort.

In my QuickThread allocator I use a similar technique (but which is extended for NUMA considerations). The QT allocator is not so sensitive to allocations performed in thread A, node passed to thread B, node passed to other than thread A and deleted. A larger footprint will occure but excessive allocations/deallocations in this manner are accounted for and handled appropriately.

As to if this would work for your application it would be hard to say. I think a similar technique that you roll for yourself should work just fine.

Jim Dempsey

jimdempseyatthecove · ‎06-03-2009

Another technique I forgot to mention, when your ~400,000 pending requests result in allocations of differing sizes, try creating a polymorphic object that can be any of the differing sized objects (which cause the allocation problem). The polymorphic object would always allocate to the size of the largest encapsulated object. There will be some memory waste when used on smaller objects, but you may make this up in having more re-usibility of the objects once deallocated. (a simple way of doing this is with a union and a typeID)

Jim Dempsey

redcat76 · ‎06-04-2009

Quoting - jimdempseyatthecove

What you might want to consider is restrict the use of the TBB allocator to objectswith high frequency of allocation/deallocation. Then use new or STL allocator for the lower frequency allocation/deallocation. This may get you through the problem.

A second technique would be to create and use persistant objects and object pool. Then in place of deleting an object you retire it to a pool (first to a non-interlocked thread private pool, then next to an interlocked global pool). This can all be hidden away in your own allocator and then you have the capability to fine tune the allocation/deallocation as a trade-off of footprint vs overhead. And you can slip this into your existing code without too much effort.

In my QuickThread allocator I use a similar technique (but which is extended for NUMA considerations). The QT allocator is not so sensitive to allocations performed in thread A, node passed to thread B, node passed to other than thread A and deleted. A larger footprint will occure but excessive allocations/deallocations in this manner are accounted for and handled appropriately.

As to if this would work for your application it would be hard to say. I think a similar technique that you roll for yourself should work just fine.

Jim Dempsey

Thank you for proposed solutions!
I'll give a try with object pool model.I just had a feeling from documentation that tbb::scalable_allocator behaves in a way similar to such an object pool. The only problem is there's no way to shrink its allocated VM space after peak loads...

redcat76 · ‎06-04-2009

Quoting - jimdempseyatthecove

Another technique I forgot to mention, when your ~400,000 pending requests result in allocations of differing sizes, try creating a polymorphic object that can be any of the differing sized objects (which cause the allocation problem). The polymorphic object would always allocate to the size of the largest encapsulated object. There will be some memory waste when used on smaller objects, but you may make this up in having more re-usibility of the objects once deallocated. (a simple way of doing this is with a union and a typeID)

Jim Dempsey

For me this is not the problem as all requests have fixed size of ~2K. In a way they are already polymorphic just as you suggested.

I described the problem in general, but to be more exact I use this model for logging: many threads are posting log records, 1 thread persists them to log file. To avoid dynamic allocations for stream buffers of different sizes I use fixed-size buffer of 2K inside each log record object. The problem is that most log records use on average ~5% of 2K buffer space. So the immediate solution would be to optimize this and place more data in one log record (e.g. several lines of log messages up to buffer maximum space). Thus I hope to reduce the number of log record objects allocated.

jimdempseyatthecove · ‎06-04-2009

Possibly the easiest way to "hack in"a fix is to modify the tail-end TBB task to pass the data into a filter handled by a non-TBB thread. This filter could compact the data (write when necessary)and return the 2K buffer (or release the 2K buffer). It is not pure TBB but so what. The compaction could be done on either the TBB side or the ancilliary thread. The benefit of using the ancilliary thread is the TBB thread won't block for I/O (since I/O is now done in the ancilliary thread).

Jim Dempsey

Jim Dempsey

Vivek_Rajagopalan · ‎06-04-2009

Quoting - jimdempseyatthecove

In my QuickThread allocator I use a similar technique (but which is extended for NUMA considerations). The QT allocator is not so sensitive to allocations performed in thread A, node passed to thread B, node passed to other than thread A and deleted. A larger footprint will occure but excessive allocations/deallocations in this manner are accounted for and handled appropriately.

As to if this would work for your application it would be hard to say. I think a similar technique that you roll for yourself should work just fine.

Jim Dempsey

I read the Quick Thread PDF. It sounds interesting , where can I find more details ?

redcat76 · ‎06-05-2009

Quoting - jimdempseyatthecove

Possibly the easiest way to "hack in"a fix is to modify the tail-end TBB task to pass the data into a filter handled by a non-TBB thread. This filter could compact the data (write when necessary)and return the 2K buffer (or release the 2K buffer). It is not pure TBB but so what. The compaction could be done on either the TBB side or the ancilliary thread. The benefit of using the ancilliary thread is the TBB thread won't block for I/O (since I/O is now done in the ancilliary thread).

Jim Dempsey

Jim Dempsey

Thank you, Jim! I'm really impressed by the number of options you gave me to solve this!

For now I used the following simple solution: I know the places in source code that can generate lots of logging. So for such cases I implemented a stream-based logger that writes all incoming log messages of the same type to internal fixed-size buffer and creates LogRecord objects using tbb::scalable_allocator only in 3 cases:
- buffer overflows (then I effectively use all 2K memory space and do not create extra costly LogRecords);
- logging thread explicitly calls flush (not used for now);
- stream object destroys (flushes all pending data);
This allowed me to greatly reduce the number of LogRecord allocations and the resulting VM footprint is now quite low.

Still imho all these are just workarounds. I have a feeling that generic scalable allocator should definately help to scale up at the cost of more memory resources but also should have an option to scale down in case of peakstress loads. Or have an explicitnote in documentation on possible unbounded VM waste under certain conditions. And arecommendation to"usemostly under64-bit systems". Since as I have seen in my case it can quite quickly and not very expectedly consume all 2G of 32-bit virtual memory space available for processes in 32-bit Windows systems. Again I may be wrong, just imho.

Anyway, Jim, thanks again! I hope to get your valuable assistance if I have other posts in this forum.

jimdempseyatthecove · ‎06-05-2009

Quoting - Vivek Rajagopalan

I read the Quick Thread PDF. It sounds interesting , where can I find more details ?

Send me your email and I will email a current .doc file and other supporting documents. If after reading, you find you are interested in exploring further, I can send you a beta test kit. Note, QuickThread runs on Windows platforms now, when I have time, and if someone can help me figure out Linux (Ubuntu installed on my system on seperate disk), then I can get to work on a Linux version.

The beta license will restrict you to "evaluation purposes only". Things are in a little bitin flux now and a few more revisions are expected before you ship applications built with it. The more testers I have the better the shake-down.

Jim Dempsey
jim (dot) (zero) (zero) (dot) dempsey (at) gmail (dot) com
or
jim (underbar) dempsey (at) ameritech (dot) net

jimdempseyatthecove · ‎06-05-2009

You may find it a non-option to dynamicly tune any scalable allocator. Your only option may be to tune your application or make concessions, one or more of:

throttlethe applicationdown when you get too much of a back log
discard some log messages if acceptible
Convert "Warning Will Rogers..." into codes byte, short, word, long, ...
Devise a compression technique that is fast and can reconstitute the original log
Use a pipe to push log data to seperate process (on x32 system)
Use memory mappend file in lieu of pipe to push log data to seperate process (on x32 system)
... there may be a few other tricks you can use too.

Jim Dempsey

Alexey-Kukanov · ‎06-06-2009

Quoting - redcat76

Thank you for proposed solutions!
I'll give a try with object pool model.I just had a feeling from documentation that tbb::scalable_allocator behaves in a way similar to such an object pool. The only problem is there's no way to shrink its allocated VM space after peak loads...

Indeed the TBB allocator does have the pool of objects (more exactly, a variety of per-thread pools for objects of different sizes).

I think the real issue is that due to faster allocations the peak itself is higher than what the system can tolerate. I.e. the application just does not survive the peak. If VM was returned to OS, this would slow down both allocation and deallocation, thus lowering the rate of allocations, and decreasing the peak load. To me, that sounds morelike an artificialworkaround, while what you have implemented is the right solution: first, improve memory utilization in the app, and second, check if there is so much data not yet processed that lack of memory might be a problem.

We will consider adding a function to flush unused VM in future versions of the allocator. This is hard to do in the current design, when VM is mapped by rather big pieces (1M) and is then distributed across many threads - it is hardly ever possible to have every piece of it freed to return it all back to OS. Allocation speed in multi-threaded environment comes with a price [added] - though possibly the price component in the question can be reduced.

jimdempseyatthecove · ‎06-07-2009

I agree here with Alexey. It would be questionable if a slab based allocator (TBB and QuickThread) could ever return slabsof VM. Alexey might be able to answer this for TBB. It could be possible, that when memory is low, and you find an over abundance of returned allocations of some size, and if these former allocations were large enough, that test allocations could be split in order to satisfy the current allocation problem. If you are tight on memory, you may have to go with malloc/free. Or some mixture.

Have you experimented with using a pipe to a seperate process? Note, each process on 32-bit system is a seperate virtual address space. Until you migrate to 64-bit, you can shove some data (or processing) into a seperate VM.

Or you could even use OpenMPI on the same system to accomplish the same thing. Place major functions with low communication overhead into seperate processes. Then use OpenMPI in place of pipe/memory mapped file. Your work effort would apply for use on larger MPI based systems later.

Jim Dempsey

redcat76 · ‎06-08-2009

Quoting - Alexey Kukanov (Intel)

Indeed the TBB allocator does have the pool of objects (more exactly, a variety of per-thread pools for objects of different sizes).

I think the real issue is that due to faster allocations the peak itself is higher than what the system can tolerate. I.e. the application just does not survive the peak. If VM was returned to OS, this would slow down both allocation and deallocation, thus lowering the rate of allocations, and decreasing the peak load. To me, that sounds morelike an artificialworkaround, while what you have implemented is the right solution: first, improve memory utilization in the app, and second, check if there is so much data not yet processed that lack of memory might be a problem.

We will consider adding a function to flush unused VM in future versions of the allocator. This is hard to do in the current design, when VM is mapped by rather big pieces (1M) and is then distributed across many threads - it is hardly ever possible to have every piece of it freed to return it all back to OS. Allocation speed in multi-threaded environment comes with a price [added] - though possibly the price component in the question can be reduced.

Thank you, Alexey! Your arguments sound quite convincing. So I agree it is more app-level responsibility to fine -tune such things.

One of the reasons I started this thread though is that with MSVC run-time mallocfree my app survives even the peak load at the cost of performance. And the memory allocation after the peakis the same as before as all the memory is returned back to the system. But withTBBscalable_allocator itcrashes.So another option for me is to temporary switch to mallocfree-based allocator to process peak loads.

Not sure how it fits into current design as I still did not study the sources, but I think this feature can be added to tbb::scalable_allocator. If the memory allocated by TBB is higher than a threshold set by a call to some interface function (set_max_scalable_memory or smth) than TBB passes all allocations to mallocfree. By default there is no threashold.I suspect TBB may not be aware of the total allocations made by all threads, so it may be a thread-based feature. Then app programmer can configure thisthreshold per box based on VM size (32-bit or 64-bit), amount of physical memory and results of stress testing.

redcat76 · ‎06-08-2009

Quoting - jimdempseyatthecove

I agree here with Alexey. It would be questionable if a slab based allocator (TBB and QuickThread) could ever return slabsof VM. Alexey might be able to answer this for TBB. It could be possible, that when memory is low, and you find an over abundance of returned allocations of some size, and if these former allocations were large enough, that test allocations could be split in order to satisfy the current allocation problem. If you are tight on memory, you may have to go with malloc/free. Or some mixture.

Have you experimented with using a pipe to a seperate process? Note, each process on 32-bit system is a seperate virtual address space. Until you migrate to 64-bit, you can shove some data (or processing) into a seperate VM.

Or you could even use OpenMPI on the same system to accomplish the same thing. Place major functions with low communication overhead into seperate processes. Then use OpenMPI in place of pipe/memory mapped file. Your work effort would apply for use on larger MPI based systems later.

Jim Dempsey

So far I resolved the issue by making a simple stream buffer that makes more effective use of memory. So now lack of VM space is not a problem.

Still using a separate process for logging may be a good idea. But smth tells me that pipes may not be fast enough touse them directly from logging threads. So I may leave current design unchanged (logginmg threads just postb requests to processing thread) and make processing thread send log requests through a pipe to a separate logging process instead of directly writing to a file on disk. Should be faster, so memory consumption in main app should be lower. Logging processmay have 1 thread for accepting incoming requests and 1 - for actual disk writes. Otherwise this chain will again work with speed of disk access.

jimdempseyatthecove · ‎06-08-2009

Quoting - redcat76

So far I resolved the issue by making a simple stream buffer that makes more effective use of memory. So now lack of VM space is not a problem.

Still using a separate process for logging may be a good idea. But smth tells me that pipes may not be fast enough touse them directly from logging threads. So I may leave current design unchanged (logginmg threads just postb requests to processing thread) and make processing thread send log requests through a pipe to a separate logging process instead of directly writing to a file on disk. Should be faster, so memory consumption in main app should be lower. Logging processmay have 1 thread for accepting incoming requests and 1 - for actual disk writes. Otherwise this chain will again work with speed of disk access.

Pipes are fairly fast. However, if you use a memory mapped file, not for the I/O buffer, but as a ring buffer visible to both processes, then what you have isshared memory block(s).

Buffer[(fillPointer++)%bufferSize] = data;

The data will be visible to the other thread in the other process as fast as it takes the cache coherency system to percolate the write. Caution, this buffer may reside at different addresses in each process. Don't pass information usingpointers. Instead use offsets or values.

Jim Dempsey

jimdempseyatthecove · ‎06-08-2009

Quoting - redcat76

Thank you, Alexey! Your arguments sound quite convincing. So I agree it is more app-level responsibility to fine -tune such things.

One of the reasons I started this thread though is that with MSVC run-time mallocfree my app survives even the peak load at the cost of performance. And the memory allocation after the peakis the same as before as all the memory is returned back to the system. But withTBBscalable_allocator itcrashes.So another option for me is to temporary switch to mallocfree-based allocator to process peak loads.

Not sure how it fits into current design as I still did not study the sources, but I think this feature can be added to tbb::scalable_allocator. If the memory allocated by TBB is higher than a threshold set by a call to some interface function (set_max_scalable_memory or smth) than TBB passes all allocations to mallocfree. By default there is no threashold.I suspect TBB may not be aware of the total allocations made by all threads, so it may be a thread-based feature. Then app programmer can configure thisthreshold per box based on VM size (32-bit or 64-bit), amount of physical memory and results of stress testing.

Your suggestion (threshold) wouldn't work out because the free memory inside the TBB pool would not be available to the subsequentmalloc/free allocations. This would compound the tight memory situation. The best solution, under the tight memory situation, is to sparingly use the TBB pool system for rapid allocations that do not experience a balooning effect (e.g. your 2K transactions under peak load). For those types of allocations you would write your own allocator that can and should use a threshold system. Wether it incorporates TBB into the scheme would be up to you.

Jim Dempsey

redcat76 · ‎06-08-2009

Quoting - jimdempseyatthecove

Pipes are fairly fast. However, if you use a memory mapped file, not for the I/O buffer, but as a ring buffer visible to both processes, then what you have isshared memory block(s).

Buffer[(fillPointer++)%bufferSize] = data;

The data will be visible to the other thread in the other process as fast as it takes the cache coherency system to percolate the write. Caution, this buffer may reside at different addresses in each process. Don't pass information usingpointers. Instead use offsets or values.

Jim Dempsey

True, shared memory is quite fast, but again if any logging thread is to access and hence to write to it it requires low-latency lock for simultaneous writers and besides a lock for reader (log-processing app). Since writers are threads of one process they can use a critical section object. Reader can be implemented without a lock if writers maintain current write position in the shared memory and reader will check write position twice: before and after reading to avoid dirty reads... Just some fast comments, but definately worth thinking this over.

Speaking of memory-mapped files... I have another interesting task. 1 processis writing a stream of data to the shared memory that is read by other processes. I implemented a non-blocking access to it by similar technique I described above: writer updates write position in the shared memory using interlocked function. Readers check write position before and after reads to determine dirty data (writer has ovewritten memory while reader accessed it). Everyhting is fine but I have 1 problem: effective notification of readers that there's data available in the shared memory. I don't want reading processes constantly poll the memory to see if there's anything to read. Currently I use a separate auto-reset event object for each reader: once writer appends new chunk to the shared memory it sets all registered events. Readers read up to the write position and then wait on their individual events. But calling SetEvent for each registered reader introsuces performance spikes in Writer process, which I'd like to avoid. I squeezed my brains trying to figure out a scheme that would use 1 synchronization object... But so far no luck. Since this topic is not directly related to TBB and in case you have some ideas to discuss here's my e-mail: RRedCat@Yandex.Ru.
Again thank you for your suggestions!