- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hey all,
I'm writing some low-level code (for cache optimized data structures), and it's critical that I know that the pointer returned by scalable_malloc points to the beginning of a cache-line. Is this the default behaviour of scalable_malloc? Is it even possible to ensure that I've actually allocated memory at the beginning of a cache-line?
Thanks
AJ
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
So what's wrong with the cache-aligned allocator, and when will it be fixed?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
So what's wrong with the cache-aligned allocator, and when will it be fixed?
Nothing is wrong with it, except may be the padding added for sake of alignment, since cache_aligned_allocator might use either scalable_malloc or "regular" malloc. Also its interface is different from malloc, though I would not call it an issue.
But since AJ asked about scalable_malloc, I only covered scalable_malloc behavior in my answer.And you have already suggested cache_aligned_allocator anyway.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Let me clarify what I'm doing. I'm doing some data structure research, and I will be allocating some multiple of the size of a cache-line. I get the cache-line size from CPUID. For my data structure to work, I'd have to be guaranteed that my data is allocated at the beginning of a cache-line. This was the motivation for my question, since I'm using scalable_malloc. I know that for a single element, I could have used the cache_aligned_allocator, perhaps I should look at the code for that.
Within the data block I allocate (which is some multiple of the cache-line size) I use placement new to put data inside.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
"Let me clarify what I'm doing." How big are those allocations? How many of them (how much overhead can be tolerated)? If you can tolerate overhead, go with cache_aligned_allocator (you can specify how many objects you want, it's not just for a single object), and then you shouldn't need the placement new (that's all from the documentation, I haven't used it myself yet). Otherwise, if I recall correctly from messing around with the code (see "Additions to atomic
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
"Let me clarify what I'm doing." How big are those allocations? How many of them (how much overhead can be tolerated)? If you can tolerate overhead, go with cache_aligned_allocator (you can specify how many objects you want, it's not just for a single object), and then you shouldn't need the placement new (that's all from the documentation, I haven't used it myself yet). Otherwise, if I recall correctly from messing around with the code (see "Additions to atomic
I'm in the habit of simplifying my questions to just the specific one I have, rather than dumping the whole complex thing... however Raf, you're making me do this! :-)
I've built a fixed_size_list representation, which guarantees to put elements into cache-lines (if they fit obviously). However, it also puts a bitmask at the end to store information about each element. In particular, whether that element is junk data (that position is empty), or if there is an element there. I've also built a sublist object which meets the TBB Range concept, and the idea is to be able to split the list into cache-line sized chunks, and each cache line has info it needs for working with elements in parallel. I will be using this component to build my own implementation other data structures. In particular, I intend to use this component in an attempt to rebuild concurrent_vector to my liking. More on that later, first this has to be done :-)
In particular, I've followed the atomic float discussion, and I may extend this concept slightly in a new data structure to do fine-grained locking on each element. I'm still thinking on that.
Internally the code is extremely low-level... I store a pointer to an unsigned int*... the performance seems very good so far, very few cache misses. However, I must ensure that the bitmask is in the cache line... hence I must ensure that I start the allocation at the beginning of a cache-line.
You can browse my code online here: http://code.google.com/p/parallel-building-blocks/source/browse/trunk/include/pbb/pbb_list.h
My data structure will fail dismally if the beginning of my unsigned int* does not point to the beginning of a cache-line. Theoretically I could do an assert to ensure that I'm pointing at the beginning of a cache-line. I have to research this to see how I would do it.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
"I'm in the habit of simplifying my questions to just the specific one I have, rather than dumping the whole complex thing... however Raf, you're making me do this! :-)" And all without answering my questions... but they were just rhetorical anyway.
"I have to research this to see how I would do it." assert(0==(((uintptr_t)mypointer)&(cache_line_size-1)));
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
"I'm in the habit of simplifying my questions to just the specific one I have, rather than dumping the whole complex thing... however Raf, you're making me do this! :-)" And all without answering my questions... but they were just rhetorical anyway.
"I have to research this to see how I would do it." assert(0==(((uintptr_t)mypointer)&(cache_line_size-1)));
The assert fails :-(
I've started to look through the cache-aligned allocator to see how it works.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
"The assert fails :-(" What is the cache line size, and which request size failed?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The assert fails :-(
I've started to look through the cache-aligned allocator to see how it works.
Currently cache_aligned_allocator must return blocks aligned on 128 bytes. I am not sure how the assert can fail... Does it really return blocks NOT aligned on 128 bytes?
What cache line size are you getting from CPUID?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Actually, malloc() is used past 8064 bytes (not 4096): scalable_malloc() inflates the requested size by up to slightly over 200% (sic) and returns a pointer at something like 24/32 modulo 16 kB (on a 32/64-bit architecture). So I'm afraid I was referring to my "Additions to atomic
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Actually as Alexey pointed out, small objects will be aligned to the cache-line using scalable_malloc. So I could do a switch much like in the cache_aligned_allocator. Use scalable_malloc for small data sizes, and for larger allocate an extra 64 bytes (or whatever the cache-line length is) and perform the allocation myself. I'll try this approach.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
"I'll try this approach." I agree, if you can somehow reproduce the original decision at deallocation time, e.g., by knowing the original request size, which a malloc/free-like interface doesn't. BTW, maximum overhead in my code can be a lot less than 100%, depending on the settings.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page