Re: tbb allocator questions

zach_turner · ‎06-29-2009

Couple of questions about allocators:

How do the alignment guarantees of scalable_allocator and cache_aligned_allocator compare with OS page alignment? If I have an OS system call that requires a page aligned buffer, can I be assured that these two allocators will satisfy those requirements?

Has there been any consideration of including a pool_allocator or fixed_block_allocator? For example, maybe I allocate / deallocate thousands of 4KB buffers, but rarely allocate buffers of other sizes. Or maybe I want to allocate 16 4KB buffers at a time, having them all in continguous memory. I can work around this by just calculating the size in advance and allocating it manually, but I suspect other optimizations can be made if the allocator itself has this kind of extra information.

Lastly, if I know I'm always going to be using a given allocator from a single thread, what about the possibility of using a thread-local heap? Not meaning thread local storage, but just a heap whose allocate / free operations internally assume that they're being accessed only from a single thread and as such do NOT internally use malloc / free / new / delete.

robert_jay_gould · ‎06-29-2009

Quoting - zach.turner

Couple of questions about allocators:

How do the alignment guarantees of scalable_allocator and cache_aligned_allocator compare with OS page alignment? If I have an OS system call that requires a page aligned buffer, can I be assured that these two allocators will satisfy those requirements?

Has there been any consideration of including a pool_allocator or fixed_block_allocator? For example, maybe I allocate / deallocate thousands of 4KB buffers, but rarely allocate buffers of other sizes. Or maybe I want to allocate 16 4KB buffers at a time, having them all in continguous memory. I can work around this by just calculating the size in advance and allocating it manually, but I suspect other optimizations can be made if the allocator itself has this kind of extra information.

Lastly, if I know I'm always going to be using a given allocator from a single thread, what about the possibility of using a thread-local heap? Not meaning thread local storage, but just a heap whose allocate / free operations internally assume that they're being accessed only from a single thread and as such do NOT internally use malloc / free / new / delete.

1)cache_aligned_allocator is page aligned, scalable isn't (with respect to each allocation), so you can't use scalable if you absolutely require page alignment.

2) I agree a fixe sized allocator would be nice, so far I've fulfilled this need with a Loki allocator layered on top of acache_aligned_allocator, only issue as far as I can imagine is this feature goes beyond TBB's current scope, but would be welcomed feature IMHO too.

3) again a nice feature, but TBB is task based, so in principle it has no need for a single threaded allocator... that said the TBB allocators do exactly this behind the scenes, so if your usage pattern is as you describe, that's the behavior you get (not sure if this is an implementation detail or if its a documented feature)

RafSchietekat · ‎06-29-2009

"cache_aligned_allocator is page aligned"
No, it's cache-aligned, as the name says. :-)

Like pool allocation, the scalable allocator has zero overhead for a number of sizes, but you cannot add your own, 4096 isn't one of them (roughly a quarter of the space will be overhead for allocations of 4096 bytes), and the allocated memory won't be page-aligned.

Other than that, the scalable allocator is tailored for good intra-thread performance.

Alexey-Kukanov · ‎06-30-2009

> How do the alignment guarantees of scalable_allocator and cache_aligned_allocator compare with OS page alignment? If I have an OS system call that requires a page aligned buffer, can I be assured that these two allocators will satisfy those requirements?

Doing an implicit4K alignment on every allocation would cause siginificant memory overhead.
Thelatest TBB 2.1 updates anddeveloper releases have a special set of functions for aligned allocations, e.g. scalable_aligned_malloc, where you could specify the desired alignment.

> Has there been any consideration of including a pool_allocator or fixed_block_allocator? For example, maybe allocate / deallocate thousands of 4KB buffers, but rarely allocate buffers of other sizes. Or maybe I want to allocate 16 4KB buffers at a time, having them all in continguous memory. I can work around this by just calculating the size in advance and allocating it manually, but I suspect other optimizations can be made if the allocator itself has this kind of extra information.

To allocate 16 buffers of the same size in contiguous memory, you may use scalable_calloc call. There is no special API for fixed size pools, but the internals of the allocator should handle such use cases well.

> Lastly, if I know I'm always going to be using a given allocator from a single thread, what about the possibility of using a thread-local heap? Not meaning thread local storage, but just a heap whose allocate / free operations internally assume that they're being accessed only from a single thread and as such do NOT internally use malloc / free / new / delete.

For objects of <8K, the TBB allocator uses per-thread heaps optimized for use by one thread - the heap owner. Other threads free memory back to the heap it was allocated from; it does not block the owner and has minimal impact on the speed of its operations (unless deallocations by other threads prevail).

zach_turner · ‎06-30-2009

Quoting - Alexey Kukanov (Intel)

> How do the alignment guarantees of scalable_allocator and cache_aligned_allocator compare with OS page alignment? If I have an OS system call that requires a page aligned buffer, can I be assured that these two allocators will satisfy those requirements?

Doing an implicit4K alignment on every allocation would cause siginificant memory overhead.
Thelatest TBB 2.1 updates anddeveloper releases have a special set of functions for aligned allocations, e.g. scalable_aligned_malloc, where you could specify the desired alignment.

> Has there been any consideration of including a pool_allocator or fixed_block_allocator? For example, maybe allocate / deallocate thousands of 4KB buffers, but rarely allocate buffers of other sizes. Or maybe I want to allocate 16 4KB buffers at a time, having them all in continguous memory. I can work around this by just calculating the size in advance and allocating it manually, but I suspect other optimizations can be made if the allocator itself has this kind of extra information.

To allocate 16 buffers of the same size in contiguous memory, you may use scalable_calloc call. There is no special API for fixed size pools, but the internals of the allocator should handle such use cases well.

> Lastly, if I know I'm always going to be using a given allocator from a single thread, what about the possibility of using a thread-local heap? Not meaning thread local storage, but just a heap whose allocate / free operations internally assume that they're being accessed only from a single thread and as such do NOT internally use malloc / free / new / delete.

For objects of <8K, the TBB allocator uses per-thread heaps optimized for use by one thread - the heap owner. Other threads free memory back to the heap it was allocated from; it does not block the owner and has minimal impact on the speed of its operations (unless deallocations by other threads prevail).

Thanks for the information. I did not know about the scalable_aligned_* calls. Perhaps I could suggest an implementation of scalable_aligned_calloc in future releases?

Regarding the page allocation, I've generated quite a few allocations using scalable_allocator and cache_aligned_allocator, and it seems they have always been page aligned. I assume this is just coincidence?

Alexey-Kukanov · ‎06-30-2009

Quoting - zach.turner

Thanks for the information. I did not know about the scalable_aligned_* calls. Perhaps I could suggest an implementation of scalable_aligned_calloc in future releases?

Regarding the page allocation, I've generated quite a few allocations using scalable_allocator and cache_aligned_allocator, and it seems they have always been page aligned. I assume this is just coincidence?

What would be the semantics of aligned calloc? Should just the first object be aligned, or should every object be? If the first one, it's trivially doable with single multiplication and aligned malloc, except that the space is not necessary zero-filled. If the second one, how would the method tell back the addresses of each object, or at least the step that should be used instead of object size to iterate over the aligned objects?

Regarding the page allocation, I would guess your experience is not by pure coincidence but possibly due to the size of allocation requests. All allocations above ~8K are currently aligned on page border. But this is an implementation detail that may change, so do not rely on it and use aligned malloc.

zach_turner · ‎06-30-2009

Quoting - Alexey Kukanov (Intel)

What would be the semantics of aligned calloc? Should just the first object be aligned, or should every object be? If the first one, it's trivially doable with single multiplication and aligned malloc, except that the space is not necessary zero-filled. If the second one, how would the method tell back the addresses of each object, or at least the step that should be used instead of object size to iterate over the aligned objects?

Regarding the page allocation, I would guess your experience is not by pure coincidence but possibly due to the size of allocation requests. All allocations above ~8K are currently aligned on page border. But this is an implementation detail that may change, so do not rely on it and use aligned malloc.

Hmm that's a good question. I guess there are guarantees / assumptions the function could make that would allow the caller to be able to calculate the step size, but maybe it's not worth it. For example, if it requires alignment to be a power of 2 (perhaps it already does, I know the msvc version of _aligned_malloc has this requirement) then the client would have a well defined way of calculating the step size.

On the other hand, maybe the fact that there are cases where it would end up allocating orders of magnitude more padding than actual data might make it impractical.

Alexey-Kukanov · ‎06-30-2009

Quoting - zach.turner

... if it requires alignment to be a power of 2 (perhaps it already does, I know the msvc version of _aligned_malloc has this requirement) ...

Yes, alignment has to be a power of two for scalable_aligned_* calls.