Intel® oneAPI Threading Building Blocks
Ask questions and share information about adding parallelism to your applications when using this threading library.
2464 Discussions

how to free memory allocated by scalable_allocator

redcat76
Beginner
1,330 Views
Hi!
I have 1 worker thread that processes requests from other threads. Requests are relatively small objects ~2K each, allocated by scalable_allocator<>::allocate. Worker thread calls scalable_allocator<>::deallocate once it is done with request. When I have peak load from client threads and pending requests queue is growing (since worker thread cannot process requests as fast as they are supplied), scalable_allocator allocates memory for peak number of requests and will not retrun it to OS. AFAIK this is by design as it assumesthe threadswill be re-using the memory. Still in my case thismay result in too large memory consumption. Since peak load is very rare and I know when it can be generated I'd like to simply make scalable_allocator release its pooled memory.

Is there any way to make scalable_allocator decrease or completely release its pooled memory?
0 Kudos
1 Solution
Chris_M__Thomasson
New Contributor I
1,294 Views
Quoting - redcat76
Hi!
I have 1 worker thread that processes requests from other threads. Requests are relatively small objects ~2K each, allocated by scalable_allocator<>::allocate. Worker thread calls scalable_allocator<>::deallocate once it is done with request. When I have peak load from client threads and pending requests queue is growing (since worker thread cannot process requests as fast as they are supplied), scalable_allocator allocates memory for peak number of requests and will not retrun it to OS. AFAIK this is by design as it assumesthe threadswill be re-using the memory. Still in my case thismay result in too large memory consumption. Since peak load is very rare and I know when it can be generated I'd like to simply make scalable_allocator release its pooled memory.

Is there any way to make scalable_allocator decrease or completely release its pooled memory?

In addition to what Mr. Dempsey has written, I feel that I should point out the following thread that shows a potential problem with per-thread allocation schemes, such as TBB allocator, in general:

http://software.intel.com/en-us/forums/showthread.php?t=61716

If Thread `A allocates large amount of memory `M which is subsequently freed by Thread `B, and Thread `A does not allocate any more memory, well, all those allocations which make up `M are leaked for the duration of `A lifetime. A possible solution is to allow Thread `A to periodically or episodically call a flush function that will reclaim memory on its remote-free list.


View solution in original post

0 Kudos
28 Replies
Vivek_Rajagopalan
270 Views
Quoting - redcat76
Hi!
I have 1 worker thread that processes requests from other threads. Requests are relatively small objects ~2K each, allocated by scalable_allocator<>::allocate. Worker thread calls scalable_allocator<>::deallocate once it is done with request.

Is there any way to make scalable_allocator decrease or completely release its pooled memory?

Okay so its my turn to run into this :-)

If Thread-1 scalable_mallocs work tokens for Thread-2, should used tokens be passed back to Thread-1 to be scalable_freed ? Is this a best practice ?

In my case, both these Threads host tbb::pipelines. I observed that memory usage climbed up to almost 90% under load, but did not grow significantly after that. It may well be harmless, but I 'd like to have some control over it.

Thanks,

0 Kudos
RafSchietekat
Valued Contributor III
270 Views
"If Thread-1 scalable_mallocs work tokens for Thread-2, should used tokens be passed back to Thread-1 to be scalable_freed ? Is this a best practice ?"
Refurbishing memory, or returningit to the thread that allocated it,may be useful optimisations, but you know what they say about premature optimisation... If thread 1 continues to allocate more memory, it will soon get around to reusing the memory deallocated from inside another thread, and unless I'm mistaken it is gradually made available for other threads as well (if all its neighbours in a block are also free). Still, the assumption seems to be that there aren't that many threads and that they don't specialise in a role, which makes sense for a task-based system. If thread 1 stops doing anything with the scalable memory allocator, the memory deallocated from another thread may stay in limbo for an extended amount of time, again not very likely with tasks, but if you encounter this problem with a user thread you might just want to close it down to have its memory taken over by another thread (to be confirmed).

"In my case, both these Threads host tbb::pipelines. I observed that memory usage climbed up to almost 90% under load, but did not grow significantly after that. It may well be harmless, but I 'd like to have some control over it."
Difficult to say anything definite without more information...
0 Kudos
Vivek_Rajagopalan
270 Views
Quoting - Raf Schietekat
"If Thread-1 scalable_mallocs work tokens for Thread-2, should used tokens be passed back to Thread-1 to be scalable_freed ? Is this a best practice ?"
Still, the assumption seems to be that there aren't that many threads and that they don't specialise in a role, which makes sense for a task-based system. If thread 1 stops doing anything with the scalable memory allocator, the memory deallocated from another thread may stay in limbo for an extended amount of time, again not very likely with tasks, but if you encounter this problem with a user thread you might just want to close it down to have its memory taken over by another thread (to be confirmed).


Thanks once again Raf,

I am restructuring the code and will update this thread on what I learnt from this.
0 Kudos
RafSchietekat
Valued Contributor III
270 Views
Thanks once again Raf,

I am restructuring the code and will update this thread on what I learnt from this.
I'm curious exactly what you changed, why,and whether it was beneficial?

A clarification to what I wrote above: directly returning memory to the thread that allocated it would require being very careful to ride piggyback on a synchronisation cost that you are paying anyway, I think, otherwise you would have gained nothing, or worse. I don't know if anybody has successfully applied it yet (?), but it remains a theoretical possibility. You would be far more likely to benefit from refurbishing, though, if the opportunity presents itself.
0 Kudos
Vivek_Rajagopalan
270 Views
Quoting - Raf Schietekat
I'm curious exactly what you changed, why,and whether it was beneficial?

A clarification to what I wrote above: directly returning memory to the thread that allocated it would require being very careful to ride piggyback on a synchronisation cost that you are paying anyway, I think, otherwise you would have gained nothing, or worse. I don't know if anybody has successfully applied it yet (?), but it remains a theoretical possibility. You would be far more likely to benefit from refurbishing, though, if the opportunity presents itself.


I gave up the refurbishing idea, because it appears to be very difficult to know which thread scalable_malloced a given chunk. I guess this is because the memory was allocated by a parallel tbb::filter, which could be mapped to any available thread by the scheduler.

I restructured the code to allocate in a serial filter and sure enough the memory usage is now just 12% under load. Maybe the allocator is getting around to reusing the memory more frequently. I must admit I have not tried very hard to isolate the problem I reported earlier.

I dont yet know if this has given me any benefits. I am wary of allocating in a serial filter because of my incomplete understanding of how the cache works. If you allocate in a serial filter, does it not mean that the memory is pulled into the same cache every time ? This appears to be a waste because the actual work is done by a series of parallel filters which will pull it into a different cache (of another CPU core) in short time. On the other hand, if I allocate in parallel filter, the memory will be directly pulled into the cache in which the work happens.

Thanks,
0 Kudos
RafSchietekat
Valued Contributor III
270 Views
"I gave up the refurbishing idea, because it appears to be very difficult to know which thread scalable_malloced a given chunk. I guess this is because the memory was allocated by a parallel tbb::filter, which could be mapped to any available thread by the scheduler."
I meant "refurbish" for another purpose, to entirely avoid a free/malloc detour, as opposed to "returning" to the original thread. Well, maybe only an object deserves use of this word, and memory would be just "reused".

"I restructured the code to allocate in a serial filter and sure enough the memory usage is now just 12% under load. Maybe the allocator is getting around to reusing the memory more frequently. I must admit I have not tried very hard to isolate the problem I reported earlier."
So you weren't using a separate user thread before? Hmm... if the other filters are parallel, the data item flows through the pipeline uninterruptedly (last time I looked at the code, anyway, as this is not guaranteed), and would be freed in the same task execution, which means in the same thread. Maybe that's another thing that changed in the restructuring?

"I dont yet know if this has given me any benefits. I am wary of allocating in a serial filter because of my incomplete understanding of how the cache works. If you allocate in a serial filter, does it not mean that the memory is pulled into the same cache every time ? This appears to be a waste because the actual work is done by a series of parallel filters which will pull it into a different cache (of another CPU core) in short time. On the other hand, if I allocate in parallel filter, the memory will be directly pulled into the cache in which the work happens."
It may seem paradoxical, but looking across time each item in a serial filter is likely to be processed in a different thread, and if you follow the item through the pipeline it stays in the same thread if it only moves to parallel filters. So your reasoning is correct, but the assumption was wrong.

(Added 2009-09-12) See text.
0 Kudos
Vivek_Rajagopalan
270 Views
Quoting - Raf Schietekat
"I restructured the code to allocate in a serial filter and sure enough the memory usage is now just 12% under load. Maybe the allocator is getting around to reusing the memory more frequently. I must admit I have not tried very hard to isolate the problem I reported earlier."
So you weren't using a separate user thread before? Hmm... if the other filters are parallel, the data item flows through the pipeline uninterruptedly (last time I looked), and would be freed in the same task, which means in the same thread. Maybe that's another thing that changed in the restructuring?


I was using a separate user thread. I had hosted a pipeline in this thread (tbb_thread if that makes a difference), and the memory was being allocated by various parallel filters. The memory was being freed in another tbb_thread by other pipeline stages in that thread. I must apologize for not spending enough effort to tracking down the issue after reporting it in this forum. I moved some allocations from the parallel stage to the serial stage and the memory issues seemed to go away.


Here is my setup in all its ugly :-)

Background :
Actually this is based on an open source project I had launched earlier (not task based) called Trisul Network Metering and Forensics (http://code.google.com/p/trisul/) I completely knocked down the thread based approach of that project and am completely rewriting it as task / event based. So the project is dead in its current form. I just could not get over that fact that 3 of my 4 cores were just relaxing while the other core was at 100%. I tried some flow pinning techniques using pthreads, but did not like the results or the intricate synchronization stuff.


TBB Thread 1 : PIPELINE 1

1. Filter 1 (Serial) : Input Packets read off wire (using special hardware if available or libpcap), batches into 100-200 packets until a 1MB to total payload.

2. Filter 2 (Parallel) : Uses an instance of a framework to decode the packets protocols

3. Filter 3 (Parallel) : Uses an instance of a metering framework to count. The net result of this is a set of messages that update counters for various things like IP addresses, TCP flows, etc, etc. This could be intensive. scalable_malloc happens here.

4. Filter xx (Parallel) : Several filters that perform various deep inspection of the payload.

5. Filter 4 (Parallel) : Figure out if the packet needs to be saved (eg, for forensic purposes). If yes, apply block encryption to the payload. Pass it along

6. Filter 5 (Parallel) : Some more magic that takes the set of messages and compresses them. Outputs the messages to a concurrent_queue.

7. Filter 6 (Serial ) : All packets that need stateful handling (marked as such by a parallel stage) get processed here. Examples : IP fragment reassembly, TCP flow construction, VOIP etc. * This is a known bottleneck because essentially this stage is single threaded.

8. Filter 7 (Serial ) : Saves the packets marked as such for forensics purposes. (Could be another bottleneck).


(After this step, further stages work on the messages not on the packet data. So I decided to use another pipeline that can handle the very different work profile). The only way I could figure out how to do this was to use another tbb_thread and put a pipeline inside it. So there it was, thread 2/ pipe 2.

TBB Thread 2 : PIPELINE 2

1. Filter 1 (Serial) : Reads messages from the concurrent queue. Generates work tokens based on target data structures to be updated. We can figure it out by looking at the messages coming out of the queue.

2. Filter 2 (Parallel) : Carries out the various operations contained in command messages. Most of these actions update data structures. Some of these generate additional messages. ** scalable_free ** the messages.

3. Filter 3 (Parallel) : Carries out some calculations and summarization.

4. Filter 4 (Serial) : Does not do much actually. Just a way to keep track of tokens exiting. Occasionally, there needs to be a pruning of data structures. When it is time to do that the input stage (1) will check with this stage to confirm if the pipeline is empty and is safe to prune. See this thread http://software.intel.com/en-us/forums/showthread.php?t=68114 for more.


>>> It may seem paradoxical, but looking across time each item in a serial filter is likely to be processed in a different thread, and if you follow the item through the pipeline it stays in the same thread if it only moves to parallel filters. So your reasoning is correct, but the assumption was wrong.>>>>

Wow!! That is quite a relief to know. Now that you told me this, it seems logical that a serial filter switches threads too. My new understanding is : "The guarantee TBB provides is that only one instance of a serial filter will execute at a time. TBB does not guarantee that it will execute on any specific thread." Is this correct ? I had a wrong mental picture of a factory with lines of conveyor belts and the serial filter sitting in one of them and distributing tokens to his own and other belts.


Really appreciate your help Raf. There are very few experts out there to ask for help.


0 Kudos
RafSchietekat
Valued Contributor III
270 Views
My new understanding is : "The guarantee TBB provides is that only one instance of a serial filter will execute at a time. TBB does not guarantee that it will execute on any specific thread." Is this correct ?
You provide the only filter instances that ever exist (they may be noncopyable). Only one item at a time will be processed by a serial filter, not necessarily on the same thread as its predecessor.

(Added) I myself would try to integrate the pipelines, perhaps wiith scalable_malloc memory only moving to parallel filters.
0 Kudos
Reply