Re: Memory allocator efficiency?

uj · ‎12-28-2008

AndreiAlexandrescu, in Modern C++ Design, states that: "For occult reasons, the default allocator is notoriously slow". :)Then, to overcome some of this inefficiency, he continues with the design of a small object allocator.

TheAlexandrescu allocator (available as part of the Loki open source library) seems to work as a Singleton so memory allocations will be global. Now if I've got it right,this is exactly what theTBB allocator is designed to avoid; The whole purpose of the TBB allocator is to make allocations on a per thread basis.

My questions are:

1. Am I right in my suspicion that if the Alexandrescu allocator is globalit will effectively nullify the purpose of the TBB allocator? So using them together is quite meaningless.

2. Is it possible to say something about howthe TBB allocator compares with a typical C++ standard allocator? I understand that per thread allocation ismore efficient than global allocation in a multithreading environment,but this gain means littleif it's then dwarfed bygeneralallocation inefficiencies anyway.

3. Finally would it be unthinkable that TBB supplied a small object allocator optimized formultithreading? Not necessarily as part of the core library but as part of some accompanying utility package or something.It would be a strong additional motivation for using TBB. For example shared smart pointers would greatly benefit from it.

Thank you.

RafSchietekat · ‎12-28-2008

Is TBB's scalable allocator not to your liking? The "Additions to atomic" patch also offers smaller objects than 8 bytes, and less waste in bigger objects.

uj · ‎12-28-2008

Quoting - Raf Schietekat

Is TBB's scalable allocator not to your liking? The "Additions to atomic" patch also offers smaller objects than 8 bytes, and less waste in bigger objects.

Thank you for the links.

No I'm not dissatisfied with the TBB memory allocator. On the contrary. Now that I know more about it I'm very impressed actually. In fact to me it looks like the TBB allocator alone motivates the use of the TBB package. -:)

Again thanks for this information. I have totallystopped worryingabout inefficient heap allocations now.

RafSchietekat · ‎12-28-2008

"I have totallystopped worryingabout inefficient heap allocations now." You should only stop worrying when you're actually dead.

uj · ‎12-29-2008

Quoting - Raf Schietekat

"I have totallystopped worryingabout inefficient heap allocations now." You should only stop worrying when you're actually dead.

Well okay,let's worry some more while there's still time -:)I'm a little worried aboutthat I don't understand the relevance of atomic (as you mention in your first post)in relation to heap allocations. I don't seem able to get thatconnection.

RafSchietekat · ‎12-29-2008

The heap bone's connected to the atomic bone, the atomic bone's connected to the architecture bone (somebody please stop me!), ... It was just more convenient for me to keep the two together while trying to port the scalable allocator to PA-RISC/HP-UX/aCC (different architecture) with some proposals to use memory more efficiently (different allocator implementation). It's not unlikely that you might be able to use the allocator changes separately from the rest of the patch, if you so desire, but you should only need to recompile anyway.

Just note that there are some allocation sizes, especially around 8 kB, that are not... well... 100% efficient yet in TBB, and you may find that the patch helps.

uj · ‎12-29-2008

Quoting - Raf Schietekat

The heap bone's connected to the atomic bone, the atomic bone's connected to the architecture bone (somebody please stop me!), ... It was just more convenient for me to keep the two together while trying to port the scalable allocator to PA-RISC/HP-UX/aCC (different architecture) with some proposals to use memory more efficiently (different allocator implementation). It's not unlikely that you might be able to use the allocator changes separately from the rest of the patch, if you so desire, but you should only need to recompile anyway.

Just note that there are some allocation sizes, especially around 8 kB, that are not... well... 100% efficient yet in TBB, and you may find that the patch helps.

So what you're saying is thatthere is no connection between the TBB memory allocator and atomic other than implementation convenience? Good. I thought for a moment I had overlooked some of the small print here.

RafSchietekat · ‎12-29-2008

"So what you're saying is that there is no connection between the TBB memory allocator and atomic other than implementation convenience?" I'm saying that "it's not unlikely" (on platforms that were already supported), but it was never a goal.

uj · ‎12-30-2008

Quoting - Raf Schietekat

"So what you're saying is that there is no connection between the TBB memory allocator and atomic other than implementation convenience?" I'm saying that "it's not unlikely" (on platforms that were already supported), but it was never a goal.

Now I'm getting worried again.Is there or isn't there a connection between the memory allocator and atomic OTHER THAN IMPLEMENTATION CONVENIENCE? Are they somehow related. Canatomic for example be used to allocatememory or to makememory allocation more efficient or undercircumstances replace memory allocation altogether? I don't think so. I think they're totally unrelatedbut because you introduced atomic into this thread and won'tgive a clear answer you're addingan element of uncertainity. Is there some usage of atomic that makes it relevant to the topic of memory allocation? Yes or no?

RafSchietekat · ‎12-30-2008

I saw some atomics-related code in the allocator that made me wonder why I had not changed it yet (which would have tied the allocator changes to the rest of the patch), but it is not unlikely that the allocator changes can still stand on their own (unless you're using a previously unsupported platform). You just have to try, if that is what you want, and if you encounter a problem I may be able to suggest an easy fix. But I just don't know myself, and I love all of my patch equally, so...

Anyway, the patch is meant to be a drop-in replacement (plus rebuild, but no user code changes required), so you don't have to commit to all of it just to validate the allocator changes.

Alexey-Kukanov · ‎12-30-2008

Quoting - uj

Is there some usage of atomic that makes it relevant to the topic of memory allocation? Yes or no?

Let me try adding some clarity.

First, you could just use the TBB memory allocator as it is; no patch is required, it is working.
But, the patch from Raf addresses some its shortcomings, in particular, memory is used more efficently for allocations of 4 bytes and less, and second, memory is used more efficiently for allocations of 8K and more. If you think it might make the difference for you, you might try his changes.
The changes are currently maintained as a part of a much bigger patch that significantly reworks atomic operations, as well as adds support for more platforms than in vanilla TBB code. Again, if it is important for you, go ahead and try it out.
Other than that, the TBB memory allocator is separate from the rest of TBB, and can be used completely independently of anything else from TBB.

uj · ‎12-30-2008

Okay, so there is nofunctional connection between the TBB allocator and atomic at all. BUT thereexists a patch that contains updates to both.Well, I guess the improvements to the TBB allocatorin thatpatch will make it intothe main TBB distribution eventually. I'm in no hurry so I can wait for that.

I've learned that the TBB allocator is much morepowerful then I first thought when I started this thread. And I'm happy to see that it's being even further improved. I think many, maybe most,C++ application based on the OO paradigm would benefit from a more efficient allocator.

Maybe the TBB allocator should be marketed more aggressively in its own right, even to developers that for the time being aren't interested in multithreading. Then whenthe time comes and they do get interested it's natural to stick with TBB and just start using the parallelstuff.

jimdempseyatthecove · ‎12-30-2008

I do not think the TBB allocator should be totally seperated from the TBB scheduler. During some otherwise long blocking sections it might be benificial to perform a stolen task. Under normal circumstances you would not experience a long blocking section (that is the reason behind all the good work put in there)but you might if an otherwise short lock section trips through a page fault and the swap file is busy or has to change size. In this case stealing tasks might be advised. The code could be written for conditional compilation as to if it were being used together with the TBB task scheduler or not.

Jim Dempsey

Alexey-Kukanov · ‎12-31-2008

Quoting - jimdempseyatthecove

I do not think the TBB allocator should be totally seperated from the TBB scheduler. During some otherwise long blocking sections it might be benificial to perform a stolen task. Under normal circumstances you would not experience a long blocking section (that is the reason behind all the good work put in there)but you might if an otherwise short lock section trips through a page fault and the swap file is busy or has to change size. In this case stealing tasks might be advised. The code could be written for conditional compilation as to if it were being used together with the TBB task scheduler or not.

Jim Dempsey

I am not sure how a user-space memory allocator that does not have any hooks into OS kernels could check page faults. Well, intercepting signals might work, but I feel it would be too much additional complexity for the potential benefits. Moreover, as the TBB allocator does not touch the memory it returns, the page faults will mostly happen after returning from the allocator calls. As if the above would not be enough, the memory allocation interface would need to be made asynchronous, or alternatively take a function to execute if long waiting is anticipated; both are rather unlike to the casual malloc(). Last but not least, the TBB itself also does not want to be tightly coupled with the allocator, because some users for whatever reasons might want to use their preferred allocator rather than the TBB one.

Might be I just do not know something, and modern OSes provide relatively convenient way to execute some code while waiting for a page load or any other blocking operaton in the kernel?

jimdempseyatthecove · ‎12-31-2008

Quoting - Alexey Kukanov (Intel)

I am not sure how a user-space memory allocator that does not have any hooks into OS kernels could check page faults. Well, intercepting signals might work, but I feel it would be too much additional complexity for the potential benefits. Moreover, as the TBB allocator does not touch the memory it returns, the page faults will mostly happen after returning from the allocator calls. As if the above would not be enough, the memory allocation interface would need to be made asynchronous, or alternatively take a function to execute if long waiting is anticipated; both are rather unlike to the casual malloc(). Last but not least, the TBB itself also does not want to be tightly coupled with the allocator, because some users for whatever reasons might want to use their preferred allocator rather than the TBB one.

Might be I just do not know something, and modern OSes provide relatively convenient way to execute some code while waiting for a page load or any other blocking operaton in the kernel?

Let me rephrase this.

Thread A acquires a mutex
Thead A enters a section of code (in memory allocator) that normally takes a short time.
Thread A encounters a section of code that hits a page fault thus extending hold on mutex to 10's, 100's or more ms.

In the mean time just after thread A got mutex, thread B attempted to acquire mutex.

At this point in time, wouldn't it be appropriate for thread B to jump into the TBB scheduler to perform task stealing (assuming appropriate tasks were available)?

An allocator that is completely isolated from the task scheduler would not be able to enter task stealing mode, its only recorse would be to spin/yield/sleep (all of which are not productive in advancing the application to solution).

Jim Dempsey

uj · ‎01-01-2009

jimdempseyatthecove: "I do not think the TBB allocator should be totally seperated from the TBB scheduler."

This is not necessary in order to accomplish what I suggested, namely that the allocator be marketed in its own right.

The only thing you need to do istelling people that if you're into OO programming you maybenefit from using the TBB allocator even though you're notusing anything elsefrom the TBB library. And the only thing that's requiredof the TBB allocator is that this is true. -:)

jimdempseyatthecove · ‎01-02-2009

Quoting - uj

jimdempseyatthecove: "I do not think the TBB allocator should be totally seperated from the TBB scheduler."

This is not necessary in order to accomplish what I suggested, namely that the allocator be marketed in its own right.

The only thing you need to do istelling people that if you're into OO programming you maybenefit from using the TBB allocator even though you're notusing anything elsefrom the TBB library. And the only thing that's requiredof the TBB allocator is that this is true. -:)

uj,

I agree with you completely. That is why my original post included:

>>The code could be written for conditional compilation as to if it were being used together with the TBB task scheduler or not.

i.e. if TBB were in use, the memory allocator would hook into the task scheduler. If TBB were not used, the code would perform a spin-wait of some sort --- or --- call a stub in which the programmer could place a call to do something productive.

Jim Dempsey

Jim

uj · ‎01-03-2009

Quoting - jimdempseyatthecove

uj,

I agree with you completely. That is why my original post included:

>>The code could be written for conditional compilation as to if it were being used together with the TBB task scheduler or not.

i.e. if TBB were in use, the memory allocator would hook into the task scheduler. If TBB were not used, the code would perform a spin-wait of some sort --- or --- call a stub in which the programmer could place a call to do something productive.

Jim Dempsey

Jim

What do you think about my suggestion in general? There seems to be quite some concensus that users of the OO style of programming benefit from a specialized allocator favouring small objects. Couldn't the TBB allocator fill a gap here. Maybe it could even be promoted into the Intel C++ compiler as an optional"standard" allocator. This would raise its status and strengthen its positionas a high quality product.

jimdempseyatthecove · ‎01-03-2009

My opinion on this is the allocator should be kept with TBB and either have a conditional option switch for integration with the TBB task scheduler or not. TBB is Threading Building Blocks as opposed to Monolithic Threading System, i.e. the blocks can be used seperately. Most of TBB is Open Source, I imagine the allocator would be part of the Open Source and therefor would be available from the threadingbuildingblocks.org website.

What might be nice is a quick lookup chart which shows dependencies such that you can easily determine if a routine is independent of the larger TBB library.

Jim Dempsey

robert_jay_gould · ‎01-05-2009

Quoting - uj

What do you think about my suggestion in general? There seems to be quite some concensus that users of the OO style of programming benefit from a specialized allocator favouring small objects. Couldn't the TBB allocator fill a gap here. Maybe it could even be promoted into the Intel C++ compiler as an optional"standard" allocator. This would raise its status and strengthen its positionas a high quality product.

uj, I use TBB with Loki's Small Object allocator in some parts of my code, and it works fine (as far as I know, there might be a bit of overhead somewhere, but the benefits outweigh the loss) I haven't had any issues so far, since the allocator , as it stands in the latest version of Loki, allows for threading/non-threading policies.

uj · ‎01-12-2009

Quoting - robert.jay.gould

uj, I use TBB with Loki's Small Object allocator in some parts of my code, and it works fine (as far as I know, there might be a bit of overhead somewhere, but the benefits outweigh the loss) I haven't had any issues so far, since the allocator , as it stands in the latest version of Loki, allows for threading/non-threading policies.

Thank you.

Well, I actually expected the Loki allocator to work withthe TBB allocator. What I didn't realize when I started this thread was that the TBB allocator was a small object allocator in its own right. So there seems to be no reason to use some other small object allocator, such as the Loki one,in place of the TBB allocator just to get efficient small object allocation.

What one should realize though is that when the Loki allocator is used it replaces the TBB allocator and that can have efficiency issues. This of course may mean little if one's program benefits greatly fromthe Loki library as you said.

Again I would like to put forward this idea that the TBB allocator should be marketed in its own right. Maybe even within the Intel compiler as an "official" replacement of the standard general allocator. It could preferablybe used in programs whichrely heavily on theOO paradigm. (Then the Loki library wouldn't even need the Loki allocator). The purpose would be to raise the status and awareness of the TBB allocator in general, something also the TBB library would benefit from.