Intel® oneAPI Threading Building Blocks
Ask questions and share information about adding parallelism to your applications when using this threading library.
2465 Discussions

TBB Malloc memory consumption always rising in our application

danlavoie
Beginner
1,824 Views

Hi,

First of all, I must say thatI amreallyenjoying theTBB library.I modified a few of our single thread algorithms to take advantage of the parallel_for and parallel_sort constructs, andI am really impressed by the results, especially on 8 and 16 cores servers.

Ialso decided to use TBB Malloc as our default allocator. I read the technical article, as well as the source code, to try to better understand how it works and what we should expect in terms of speed up on many cores, etc.

I know that memory is never returned to the OS, as it would add the necessity to lock, and thus would remove most if not all the scalability benefits. However, in our server application, I see a constant rise in memory consumption, and unfortunately, even on 64 bits systems, it can lead to out of memory condition (without TBB, our application in my test setup uses 1 to 1.2 Gb of memory, with TBB, it goes over 2 Gb, and keeps on rising).

I looked at the code to try to understand what we were doing that could possibly be causing TBB Malloc to enter a pattern where memory was not re-used effectively.

As I understand it, TBB Malloc tries to allocate memory from the TLS structures first, then from the publicly freed objects, and lastly from the other threads structures. Our application has a lot of threads running at the same time, over 100 in a typical setup. Also, memory is often allocated in a thread to be freed in another. Finally, threads are created and destroyed from time to time. Is it possible that TBB Malloc favors allocating locally, but to avoid locking when freeing using another thread, favors releasing memory in the public list ? Is it possible that this list is seldom used with our way of allocating and destroying objects and it becomes a very huge list of allocations ? Finally, is there a way to know at runtime, besides recompiling with statistics ?

I will try to analyze our pattern of allocations better, but any help or tip about TBB Malloc inner workings in this situation would be greatly appreciated.

Thank you for your time,

Daniel Lavoie

0 Kudos
22 Replies
Alexey-Kukanov
Employee
1,656 Views
Hi Daniel, I am glad to answer your questions.

> As I understand it, TBB Malloc tries to allocate memory from the TLS structures first, then from the publicly freed objects, and lastly from the other threads structures.
Not fully correct. Every thread allocates from its thread-local structures, and only there. Pieces of memory freed in a different thread than it was allocated in, is returned back.
If a thread runs out of its 16K slabs that serve allocation requests, it takes another one, first looking throughslabs abandoned by threads that left, second, looking at the pool of unused slabs, and only then it requests to map more memory.

> Finally, threads are created and destroyed from time to time.
Until recently, there was a leak of ~400-800 bytes of service memory per thread. So if a lot (tens of thousands)of threads is created and destroyed during the life time of the process, the total leak grew rather big. It was recently fixed though I do not remember since which exactly package the fix is in; I think that it should be in the latest com-aligned release; either there, or at least in 1-2 most recent development updates.

> Is it possible that TBB Malloc favors allocating locally, but to avoid locking when freeing using another thread, favors releasing memory in the public list ?
All is true (and even stronger, as I described above), but I do not see why it might be a problem.

> Is it possible that this list is seldom used with our way of allocating and destroying objects and it becomes a very huge list of allocations ?
Since public lists are searched before requesting more memory from OS, that should not be the case.

> Finally, is there a way to know at runtime, besides recompiling with statistics ?
No. I am not sure there should be; but possibly we might need to make the statistics version pre-compiled.

My best guess is that that leak of service memory might be your problem. Since already fixed, that onewould be rather easy to resolve then :)

0 Kudos
danlavoie
Beginner
1,656 Views
Hi Alexey,
Thank you for your answer, unfortunately, we run with the latest build which has the fix for the memory leak.
I will continue to investigate the issue on our side and will come back if I have other questions.
Have a nice day,
Daniel

0 Kudos
Alexey-Kukanov
Employee
1,656 Views
Do you know the typical (i.e. most frequent) allocation sizes for your application? In particular, a lot of 8K+ allocations might cause the problem, due to a known issue of excessive padding added by TBB allocator for these sizes.
0 Kudos
Dmitry_Vyukov
Valued Contributor I
1,656 Views
Quoting - danlavoie

I know that memory is never returned to the OS, as it would add the necessity to lock, and thus would remove most if not all the scalability benefits.

There is no problem in returning excessive, not used for a long time memory back to the OS. This must be done infrequently and on a page basis, it's impossible to accomplish this on a per-object basis (so locks can't be frequent anyway).

0 Kudos
Dmitry_Vyukov
Valued Contributor I
1,656 Views
Quoting - danlavoie

However, in our server application, I see a constant rise in memory consumption, and unfortunately, even on 64 bits systems, it can lead to out of memory condition (without TBB, our application in my test setup uses 1 to 1.2 Gb of memory, with TBB, it goes over 2 Gb, and keeps on rising).

It's perfectly Ok that scalable solution requires 2x or 4x or 8x memory. On 16-core machine scalable solution HAVE TO consume 16x memory!

Have many programmers do you have in the company? Let's say 16. How many workplaces (computer, table, chair, network link etc) do you need for them? 1 or 16? Think about it! It's not enough to create 16 threads to achive scalability, you need 16 of everything.

0 Kudos
Alexey-Kukanov
Employee
1,656 Views
Quoting - Dmitriy V'jukov
It's perfectly Ok that scalable solution requires 2x or 4x or 8x memory. On 16-core machine scalable solution HAVE TO consume 16x memory!

For frequently modified objects, yes; also for temporary storages etc. Read-only or infrequently modified objects can be shared without affecting scalability, though of course enough diligence should be put in place to do sharing right (both from correctness and performance perspectives) and not overuse it.

0 Kudos
Dmitry_Vyukov
Valued Contributor I
1,656 Views
Quoting - danlavoie
I looked at the code to try to understand what we were doing that could possibly be causing TBB Malloc to enter a pattern where memory was not re-used effectively.

As I understand it, TBB Malloc tries to allocate memory from the TLS structures first, then from the publicly freed objects, and lastly from the other threads structures. Our application has a lot of threads running at the same time, over 100 in a typical setup. Also, memory is often allocated in a thread to be freed in another. Finally, threads are created and destroyed from time to time. Is it possible that TBB Malloc favors allocating locally, but to avoid locking when freeing using another thread, favors releasing memory in the public list ? Is it possible that this list is seldom used with our way of allocating and destroying objects and it becomes a very huge list of allocations ? Finally, is there a way to know at runtime, besides recompiling with statistics ?

TBB caches memory on a per-thread AND on a per-object-size basis.

Assume you allocate 100 objects of size 128 in thread 1, then free them. Then allocate 100 objects of size 256 in thread 1, then free them. Then allocate 100 objects of size 128 in thread 2, then free them. Then allocate 100 objects of size 256 in thread 2, then free them.

With plain non-caching allocator total memory consumption is 100*256.

With TBB allocator total memory consumption is 100*128 + 100*256 + 100*128 + 100*256.

Now assume you have 100 threads and 10 object sizes. Yeah, it's trade space for speed :)

Well, if caching on a per-object-size basis is a pure single-threaded performance optimization; but caching on a per-thread basis is unavoidable because it's the only way to achieve scalability.

0 Kudos
Dmitry_Vyukov
Valued Contributor I
1,656 Views

For frequently modified objects, yes; also for temporary storages etc. Read-only or infrequently modified objects can be shared without affecting scalability, though of course enough diligence should be put in place to do sharing right (both from correctness and performance perspectives) and not overuse it.

Totally agree!

For example, if main data-structure in server is BIG very-frequently updated table, then it's possible to achieve scalability with ~1x memory consumption (as against non-scalable solution). It's great scenario! Read-only data must be shared.

But a kind of general rule is: if you add more programmer to your team, do buy one more computer for him! :)

0 Kudos
RafSchietekat
Valued Contributor III
1,656 Views

"With TBB allocator total memory consumption is 100*128 + 100*256 + 100*128 + 100*256." Memory blocks do get refurbished.

0 Kudos
Dmitry_Vyukov
Valued Contributor I
1,656 Views
Quoting - Raf Schietekat

"With TBB allocator total memory consumption is 100*128 + 100*256 + 100*128 + 100*256." Memory blocks do get refurbished.

What do you mean by "refurbished" here? Is it intended to affect TBB memory consumption somehow?

0 Kudos
RafSchietekat
Valued Contributor III
1,656 Views
Quoting - Dmitriy V'jukov
Quoting - Raf Schietekat

"With TBB allocator total memory consumption is 100*128 + 100*256 + 100*128 + 100*256." Memory blocks do get refurbished.

What do you mean by "refurbished" here? Is it intended to affect TBB memory consumption somehow?

returnEmptyBlock()

0 Kudos
Dmitry_Vyukov
Valued Contributor I
1,656 Views
Quoting - Raf Schietekat

returnEmptyBlock()

I see. Well, yes, it can improve situation to a certain extent... or maybe not. First, thread always caches at least 2 blocks. Second, assume that we are allocating 100 objects, and then freeing only 99 of them.

What I want to say is that memory consuption will be higher. Definitely.

0 Kudos
Dmitry_Vyukov
Valued Contributor I
1,656 Views
Quoting - danlavoie

I know that memory is never returned to the OS, as it would add the necessity to lock, and thus would remove most if not all the scalability benefits. However, in our server application, I see a constant rise in memory consumption, and unfortunately, even on 64 bits systems, it can lead to out of memory condition (without TBB, our application in my test setup uses 1 to 1.2 Gb of memory, with TBB, it goes over 2 Gb, and keeps on rising).

You can also check out following thread:

http://software.intel.com/en-us/forums/showthread.php?t=61716

There I describe pattern which can cause unbounded memory consumption in TBB allocator.

0 Kudos
sadbhaw
Beginner
1,656 Views
Quoting - danlavoie

Hi,

First of all, I must say thatI amreallyenjoying theTBB library.I modified a few of our single thread algorithms to take advantage of the parallel_for and parallel_sort constructs, andI am really impressed by the results, especially on 8 and 16 cores servers.

Ialso decided to use TBB Malloc as our default allocator. I read the technical article, as well as the source code, to try to better understand how it works and what we should expect in terms of speed up on many cores, etc.

I know that memory is never returned to the OS, as it would add the necessity to lock, and thus would remove most if not all the scalability benefits. However, in our server application, I see a constant rise in memory consumption, and unfortunately, even on 64 bits systems, it can lead to out of memory condition (without TBB, our application in my test setup uses 1 to 1.2 Gb of memory, with TBB, it goes over 2 Gb, and keeps on rising).

I looked at the code to try to understand what we were doing that could possibly be causing TBB Malloc to enter a pattern where memory was not re-used effectively.

As I understand it, TBB Malloc tries to allocate memory from the TLS structures first, then from the publicly freed objects, and lastly from the other threads structures. Our application has a lot of threads running at the same time, over 100 in a typical setup. Also, memory is often allocated in a thread to be freed in another. Finally, threads are created and destroyed from time to time. Is it possible that TBB Malloc favors allocating locally, but to avoid locking when freeing using another thread, favors releasing memory in the public list ? Is it possible that this list is seldom used with our way of allocating and destroying objects and it becomes a very huge list of allocations ? Finally, is there a way to know at runtime, besides recompiling with statistics ?

I will try to analyze our pattern of allocations better, but any help or tip about TBB Malloc inner workings in this situation would be greatly appreciated.

Thank you for your time,

Daniel Lavoie

Hi Daniel,

Did you find any solution to your problem.

We think, we are also facing same "unbound memory growth"....

Here are the detail....

We just started testing TBB for our application.

Wwe are experiencingthe unbound memory growth as explained. The application has thread pool(around 100+ threads), it allocates memory in many threads and all the memory pointers are stored in one place. At the end of transaction it goes and deletes all the memory from the stored location.

We have traffic of 4-5 transactions/sec.

When we enabled TBB, it started with 4G and then grew till 30 G and then crashed. It took almost 250 K transactions for it to crash.

Without TBB,other standard testtook15 G and is running fine.

Thank you,

Sadbhaw

0 Kudos
Alexey-Kukanov
Employee
1,656 Views
Sadbhaw,

Let me ask you some questions to collect more information about your problem.

So, you are building an application that serves some requests (transactions), as frequent as 4-5 per sec. It uses a thread pool for this (some custom implementation, I assume?); the pool contains 100 threads or even more.

Some questions:
- are all 100 threads alive all the time, or some might die and others might be created?
- is every request served solely by one thread, or by all threads in the pool, or just by some? Given the ratio of transactions to the size of the pool, I assume single thread processing is unlikely.
- can different transactions be served simultaneously? Or, to put it in other words, does a single transaction require 15G of memory, or this is the cumulative size for a few?
- do you know of which order a usual memory request size is? E.g. much more small objects (1K and less), or bigger objects, or may be really huge ones?
- Are all threads allocating memory, or just some? Is it similar for most transactions, or varies irregularly? In other words, after a transaction completes and all memory is freed, will the same threads allocate similar set of objects for the next transaction?
- you said that all memory allocated to serve a transaction is put into a global container, and freed at once at the end of transaction. Is deallocation done by a single thread, or by multiple threads simultaneously?
- if you have a test case that shows the problem on a smaller size machine, I'd be glad to look at it.

Sorry for many questions, butit is necessary to understand what might cause the problem you see.

0 Kudos
sadbhaw
Beginner
1,656 Views
Sadbhaw,

Let me ask you some questions to collect more information about your problem.

So, you are building an application that serves some requests (transactions), as frequent as 4-5 per sec. It uses a thread pool for this (some custom implementation, I assume?); the pool contains 100 threads or even more.

Some questions:
- are all 100 threads alive all the time, or some might die and others might be created?
- is every request served solely by one thread, or by all threads in the pool, or just by some? Given the ratio of transactions to the size of the pool, I assume single thread processing is unlikely.
- can different transactions be served simultaneously? Or, to put it in other words, does a single transaction require 15G of memory, or this is the cumulative size for a few?
- do you know of which order a usual memory request size is? E.g. much more small objects (1K and less), or bigger objects, or may be really huge ones?
- Are all threads allocating memory, or just some? Is it similar for most transactions, or varies irregularly? In other words, after a transaction completes and all memory is freed, will the same threads allocate similar set of objects for the next transaction?
- you said that all memory allocated to serve a transaction is put into a global container, and freed at once at the end of transaction. Is deallocation done by a single thread, or by multiple threads simultaneously?
- if you have a test case that shows the problem on a smaller size machine, I'd be glad to look at it.

Sorry for many questions, butit is necessary to understand what might cause the problem you see.

Hi Alexey,

First of all thank you for your reply.

Here are the answers to your questions...

Some questions:

- are all 100 threads alive all the time, or some might die and others might be created?

-- one thread is created for each transaction, which would then create tasks to be serviced by threads in the pool; also, a few threads are dedicated to background tasks e.g. cache updates, logging, etc.

- is every request served solely by one thread, or by all threads in the pool, or just by some? Given the ratio of transactions to the size of the pool, I assume single thread processing is unlikely.

-- there is a thread dedicated to the transaction which potentially uses all threads in the pool

- can different transactions be served simultaneously? Or, to put it in other words, does a single transaction require 15G of memory, or this is the cumulative size for a few?

--yes, different transactions can be served simultaneously. 15G is over the time (24 hrs with 4-5 trx/sec)

-- there may be up to 16 transactions active at a time

- do you know of which order a usual memory request size is? E.g. much more small objects (1K and less), or bigger objects, or may be really huge ones?

-- it varies a lot from smaller ones (1k and less) to bigger ones (more than 8K) but NOT huge ones.

- Are all threads allocating memory, or just some? Is it similar for most transactions, or varies irregularly? In other words, after a transaction completes and all memory is freed, will the same threads allocate similar set of objects for the next transaction?

-- all transactions are similar to each other

- you said that all memory allocated to serve a transaction is put into a global container, and freed at once at the end of transaction. Is deallocation done by a single thread, or by multiple threads simultaneously?

-- each thread may allocate and free memory during the course of processing a task although the bulk of the memory allocated to process a transaction is freed at the end of the transaction. as well, processing a request can cause data to be read from the database and put in the cache for use in future transactions

Thank you,
Sadbhaw

0 Kudos
sadbhaw
Beginner
1,656 Views

Hi Alexey,

Adding to above answers,

The memory is allocated in different threads (generally during course of transaction)and freed in different thread (at the end of transaction).

Thank you,

Sadbhaw

0 Kudos
robert_jay_gould
Beginner
1,656 Views
Quoting - sadbhaw

Hi Alexey,

Adding to above answers,

The memory is allocated in different threads (generally during course of transaction)and freed in different thread (at the end of transaction).

Thank you,

Sadbhaw

Just a guess, but this looksexactly like the common producer(factory)-consumer pattern I mentioned that could make tbb-allocator blowup. Because a different thread frees the memory the system will never release anything, as pointed out by the "possible allocator problem" thread. The lack of explicit mention of this is the problem though.

Sadbhaw could you try returning the dead objects back to the creator thread in a source-sink fashion? This way the memory will get properly recycled (hopefully)

0 Kudos
Alexey-Kukanov
Employee
1,656 Views
Just a guess, but this looksexactly like the common producer(factory)-consumer pattern I mentioned that could make tbb-allocator blowup. Because a different thread frees the memory the system will never release anything, as pointed out by the "possible allocator problem" thread.

I do not agree. If every next transaction is served by roughly the same set of threads which allocate objects of roughly the same size(s), then the memory freed on the previous step will be reused, no matter whether it was freed locally or remotely. The problem that Dmitry outlined in that "possible allocator problem" thread will only happen when a thread no more allocates memory of a given size, and so it does not reclaim remotely freed memory of that size. This is not the case in the discussed setup I believe, since every transactionrequires roughly the same memory as the previous one, so over time memory should be reclaimed more or less fully.

Sadbhaw could you try returning the dead objects back to the creator thread in a source-sink fashion? This way the memory will get properly recycled (hopefully)

I might be do not understand what is "source-sink fashion", but in the TBB allocatordeallocated objects are returned back to the creator thread, every time. It's just up to the creator when to reclaim it. By the way, in that forum thread I proposed a possible way to further minimize the possibility for hoarding of unused blocks; nobody commented on it yet.

As for Sadbhaw's problem, to me it looks more likely the effect of excessive padding for 8K+ sizes, which is anotherknown issue.

0 Kudos
RafSchietekat
Valued Contributor III
1,486 Views

"As for Sadbhaw's problem, to me it looks more likely the effect of excessive padding for 8K+ sizes, which is another known issue." You might want to compare results with my patch in "Additions to atomic" (use version 20081012, not the latest one), where I have smuggled in some modifications to the scalable allocator that should somewhat mitigate such excessive padding.

0 Kudos
Reply