Huge memory usage in recently-converted-to-TBB code

jind · ‎06-23-2009

Hi,

I am using a pipeline, and one of the filters does a bunch of STL map finds, erases and inserts. It doesn't explicitly allocate any new heap memory.

Here is the weird thing: for some reason, this filter leads to huge memory usage, enough so that my code cannot run to completion. When I comment out the function that does all the map stuff, everything is fine.

So the question is, is there any kind of funkiness (different memory model or allocator) that happens under the hood with TBB that I should be thinking about?

Thanks,

John

Alexey-Kukanov · ‎06-23-2009

If you do not explicitly specify that the map should use one of allocatorsprovided by TBB, and if you do not globally substitute standard memory allocation routines (malloc/free or new/delete) with their analogues provided by TBB, then std::map objects should use regular memory allocation mechanisms.

Dmitry_Vyukov · ‎06-23-2009

Quoting - jind

I am using a pipeline, and one of the filters does a bunch of STL map finds, erases and inserts. It doesn't explicitly allocate any new heap memory.

Here is the weird thing: for some reason, this filter leads to huge memory usage, enough so that my code cannot run to completion. When I comment out the function that does all the map stuff, everything is fine.

So the question is, is there any kind of funkiness (different memory model or allocator) that happens under the hood with TBB that I should be thinking about?

What do you mean by huge memory consumption?
If memory consumption by single-threaded application is X, and you have N threads, then memory consumption of X*N for multi-threaded application is Ok and expected. If memory consumption is far above X*N then it's oddly.

Your run-time/OS memory allocator may do some per-thread caching of memory which may explain increased memory consumption.

jind · ‎06-24-2009

I've narrowed it down a bit:

My filter class maintains a couple of std::maps, and each call to operator() may result in an insert, erase, or change of some elements in these maps.

These maps are relatively small (say, 5 MB altogether), yet increasing the number of threads intbb::task_scheduler_init from 1 to 2 increases memory usage by 50MB.

Can you please explain in a little more detail (or point me somewhere that can) what expectations regarding per-thread memory allocation I should expect? Surely each thread doesn't maintain a full local copy of the filter object?

What is more worrisome is that the size of the maps stays roughly the same (i.e., the inserts are roughly balanced out by the erases) as tokens are processed, yet memory usage grows with the number of tokens processed.

Quoting - Dmitriy Vyukov

What do you mean by huge memory consumption?
If memory consumption by single-threaded application is X, and you have N threads, then memory consumption of X*N for multi-threaded application is Ok and expected. If memory consumption is far above X*N then it's oddly.

Your run-time/OS memory allocator may do some per-thread caching of memory which may explain increased memory consumption.

jind · ‎06-24-2009

Okay, now it is *really* narrowed down. I can reproduce what I'm seeing using the following minimum processing for each item:

const std::map::iterator p = m.find(item.id());
if(p!=m.end()) {
m.erase(p);
}
m.insert(make_pair(item.id(), item));

If I comment out everything but the last line (i.e., do not erase before inserting), memory usage goes way, way down.

Any ideas? I'm completely stumped.

robert-reed · ‎06-24-2009

Quoting - jind

My filter class maintains a couple of std::maps, and each call to operator() may result in an insert, erase, or change of some elements in these maps.

These maps are relatively small (say, 5 MB altogether), yet increasing the number of threads intbb::task_scheduler_init from 1 to 2 increases memory usage by 50MB.

Tell us more about the filter class. Could there be multiple copies of the filter (per pool thread) playing havoc with the non-thread-safe access of the std:maps?

jind · ‎06-24-2009

I've also confirmed that the same behavior results when the value in the map is not an Item, but (say) a double or other simple data type. The only difference is that the amount of memory in the "erase-first" case goes down, but it is still significantly higher than the "no-erase-first" case.

Quoting - jind

Okay, now it is *really* narrowed down. I can reproduce what I'm seeing using the following minimum processing for each item:

const std::map::iterator p = m.find(item.id());
if(p!=m.end()) {
m.erase(p);
}
m.insert(make_pair(item.id(), item));

If I comment out everything but the last line (i.e., do not erase before inserting), memory usage goes way, way down.

Any ideas? I'm completely stumped.

jind · ‎06-24-2009

The filter class is declared as a serial_in_order filter, and has a private std::map. My (admittedly limited) understanding of filters is that a serial_in_order filter need not worry about thread safety in this case.

However, as a test, I added a mutex to the class and locked it before doing the map erase and update, with the same results...

Quoting - Robert Reed (Intel)

Tell us more about the filter class. Could there be multiple copies of the filter (per pool thread) playing havoc with the non-thread-safe access of the std:maps?

jind · ‎06-24-2009

Another tidbit: if I replace the find/erase construct above with the following, memory usage goes up a *lot* more. It seems that the call to erase(), whether or not the item exists, is the source of the problem...

Quoting - jind

m.erase(m.find(item.id()));
m.insert(make_pair(item.id(), item));

jind · ‎06-24-2009

I also just tried using concurrent_hash_map in place of std::map, and got similar results. To give an idea, without the erase, the program grew to about 100MB, whereas with it, it was over 250MB. I am literally commenting out just one line to get that difference!

RafSchietekat · ‎06-25-2009

How abouta small but self-contained program that reproduces the problem...

jind · ‎06-25-2009

As I was working on cutting things down to the simplest reproducing case, I tried replacing the second (map-add) filter with a dummy serial_in_order filter that simply passes the token along. Here is a table showing memory usage for the input filter alone, and the effect of adding the dummy filter for 1, 2, and 8 threads:

	N_THREADS
FILTERS	1	2	8
input	98	105	105
input dummy	110	445	489

Not sure if this is relevant, but even though both filters are serial_in_order and the pipeline is running with a max of one token, CPU usage reflects the number of threads when I add the dummy, but is pegged at one when I don't.

Does this suggest anything I could be missing?

Alexey-Kukanov · ‎06-25-2009

Quoting - jind

Not sure if this is relevant, but even though both filters are serial_in_order and the pipeline is running with a max of one token, CPU usage reflects the number of threads when I add the dummy, but is pegged at one when I don't.

Does this suggest anything I could be missing?

As single serial filter means no parallelism, the pipeline just drains the input in a serial loop.
Your second setup also means no parallelism, but this case is currently not recognized as such. So threads are alive and actively seek for some work. And it seems the pipeline spawns new tasksregularly, preventing worker threads from falling asleep.

jind · ‎06-25-2009

Why would the second scenario (two serial_in_order filters, only one token in the pipeline at a time) not be recognized as serial? Is it something I am doing?

Quoting - Alexey Kukanov (Intel)

As single serial filter means no parallelism, the pipeline just drains the input in a serial loop.
Your second setup also means no parallelism, but this case is currently not recognized as such. So threads are alive and actively seek for some work. And it seems the pipeline spawns new tasksregularly, preventing worker threads from falling asleep.

jind · ‎06-25-2009

Apologies for not having narrowed down the issue as tightly as possible before starting this thread. It appears that the memory usage is coming from a concurrent_queue that is getting backed up, and never releasing the memory back to the OS. This was masked by a number of factors, some of which stemmed from my own efforts to narrow down the problem (I was actually making it worse).

That said, I'm not sure I understand how the concurrent_queue deals with memory, but I started a new topic on that since it is different enough from this thread's topic...

RafSchietekat · ‎06-25-2009

#13 "Why would the second scenario (two serial_in_order filters, only one token in the pipeline at a time) not be recognized as serial? Is it something I am doing?"
No particular reason. Currently one trivial situation is trivially optimised. It seems easy enough to go a little bit further just to avoid this very question... although you should realise that such a pipeline would obviously feel very unappreciated.

#14 "Apologies for not having narrowed down the issue as tightly as possible before starting this thread. It appears that the memory usage is coming from a concurrent_queue that is getting backed up, and never releasing the memory back to the OS."
Then I guess it's the high-water mark behaviour of the scalable memory allocator rather than any problem with the queue. Don't worry, be happy: the memory will most likely be refurbished. Please consult previous discussions to find further information (maybe a FAQ entry could be dedicated to this?).