scalable_malloc and cache alignment

AJ13 · ‎11-05-2008

Hey all,

I'm writing some low-level code (for cache optimized data structures), and it's critical that I know that the pointer returned by scalable_malloc points to the beginning of a cache-line. Is this the default behaviour of scalable_malloc? Is it even possible to ensure that I've actually allocated memory at the beginning of a cache-line?

Thanks

AJ

RafSchietekat · ‎11-05-2008

There is a scalable allocator and a cache-aligned allocator, so...

Alexey-Kukanov · ‎11-05-2008

Ifyou request exactly the cache line size from scalable_malloc, it will be aligned to the beginning of the line. In fact, every power-of-two size between8and 1024 bytes is aligned to the same power-of-two boundary. There is no guarantee though that this will stay forever.

Also we will add support for explicit alignment specification (similarlyto posix_memalign or aligned_malloc).

AJ13 · ‎11-05-2008

I have never heard of posix_memalign or aligned_malloc...

Yes, I'm requesting a multiple of the cache-line size. You've answered my question perfectly.

Quoting - Alexey Kukanov (Intel)

Ifyou request exactly the cache line size from scalable_malloc, it will be aligned to the beginning of the line. In fact, every power-of-two size between8and 1024 bytes is aligned to the same power-of-two boundary. There is no guarantee though that this will stay forever.

Also we will add support for explicit alignment specification (similarlyto posix_memalign or aligned_malloc).

RafSchietekat · ‎11-05-2008

So what's wrong with the cache-aligned allocator, and when will it be fixed?

Alexey-Kukanov · ‎11-06-2008

Quoting - Raf Schietekat

So what's wrong with the cache-aligned allocator, and when will it be fixed?

Nothing is wrong with it, except may be the padding added for sake of alignment, since cache_aligned_allocator might use either scalable_malloc or "regular" malloc. Also its interface is different from malloc, though I would not call it an issue.

But since AJ asked about scalable_malloc, I only covered scalable_malloc behavior in my answer.And you have already suggested cache_aligned_allocator anyway.

Alexey-Kukanov · ‎11-06-2008

> Yes, I'm requesting a multiple of the cache-line size. You've answered my question perfectly.

I only said about exact powers of 2, not an arbitrary multiple of cache line size. In the range between cache line size and 1024, however,such multiples are also aligned to cache line size.

AJ13 · ‎11-07-2008

Let me clarify what I'm doing. I'm doing some data structure research, and I will be allocating some multiple of the size of a cache-line. I get the cache-line size from CPUID. For my data structure to work, I'd have to be guaranteed that my data is allocated at the beginning of a cache-line. This was the motivation for my question, since I'm using scalable_malloc. I know that for a single element, I could have used the cache_aligned_allocator, perhaps I should look at the code for that.

Within the data block I allocate (which is some multiple of the cache-line size) I use placement new to put data inside.

RafSchietekat · ‎11-07-2008

"Let me clarify what I'm doing." How big are those allocations? How many of them (how much overhead can be tolerated)? If you can tolerate overhead, go with cache_aligned_allocator (you can specify how many objects you want, it's not just for a single object), and then you shouldn't need the placement new (that's all from the documentation, I haven't used it myself yet). Otherwise, if I recall correctly from messing around with the code (see "Additions to atomic" patch), it may well be that scalable_malloc is always aligned, also beyond 1024. But in a research environment you don't absolutely need to know in advance: just check each allocation, and if it ever breaks change your program.

AJ13 · ‎11-07-2008

Quoting - Raf Schietekat

"Let me clarify what I'm doing." How big are those allocations? How many of them (how much overhead can be tolerated)? If you can tolerate overhead, go with cache_aligned_allocator (you can specify how many objects you want, it's not just for a single object), and then you shouldn't need the placement new (that's all from the documentation, I haven't used it myself yet). Otherwise, if I recall correctly from messing around with the code (see "Additions to atomic" patch), it may well be that scalable_malloc is always aligned, also beyond 1024. But in a research environment you don't absolutely need to know in advance: just check each allocation, and if it ever breaks change your program.

I'm in the habit of simplifying my questions to just the specific one I have, rather than dumping the whole complex thing... however Raf, you're making me do this! :-)

I've built a fixed_size_list representation, which guarantees to put elements into cache-lines (if they fit obviously). However, it also puts a bitmask at the end to store information about each element. In particular, whether that element is junk data (that position is empty), or if there is an element there. I've also built a sublist object which meets the TBB Range concept, and the idea is to be able to split the list into cache-line sized chunks, and each cache line has info it needs for working with elements in parallel. I will be using this component to build my own implementation other data structures. In particular, I intend to use this component in an attempt to rebuild concurrent_vector to my liking. More on that later, first this has to be done :-)

In particular, I've followed the atomic float discussion, and I may extend this concept slightly in a new data structure to do fine-grained locking on each element. I'm still thinking on that.

Internally the code is extremely low-level... I store a pointer to an unsigned int*... the performance seems very good so far, very few cache misses. However, I must ensure that the bitmask is in the cache line... hence I must ensure that I start the allocation at the beginning of a cache-line.

You can browse my code online here: http://code.google.com/p/parallel-building-blocks/source/browse/trunk/include/pbb/pbb_list.h

My data structure will fail dismally if the beginning of my unsigned int* does not point to the beginning of a cache-line. Theoretically I could do an assert to ensure that I'm pointing at the beginning of a cache-line. I have to research this to see how I would do it.

RafSchietekat · ‎11-07-2008

"I'm in the habit of simplifying my questions to just the specific one I have, rather than dumping the whole complex thing... however Raf, you're making me do this! :-)" And all without answering my questions... but they were just rhetorical anyway.

"I have to research this to see how I would do it." assert(0==(((uintptr_t)mypointer)&(cache_line_size-1)));

AJ13 · ‎11-08-2008

Quoting - Raf Schietekat

"I'm in the habit of simplifying my questions to just the specific one I have, rather than dumping the whole complex thing... however Raf, you're making me do this! :-)" And all without answering my questions... but they were just rhetorical anyway.

"I have to research this to see how I would do it." assert(0==(((uintptr_t)mypointer)&(cache_line_size-1)));

The assert fails :-(

I've started to look through the cache-aligned allocator to see how it works.

RafSchietekat · ‎11-08-2008

"The assert fails :-(" What is the cache line size, and which request size failed?

Dmitry_Vyukov · ‎11-09-2008

Quoting - AJ

The assert fails :-(

I've started to look through the cache-aligned allocator to see how it works.

Currently cache_aligned_allocator must return blocks aligned on 128 bytes. I am not sure how the assert can fail... Does it really return blocks NOT aligned on 128 bytes?

What cache line size are you getting from CPUID?

AJ13 · ‎11-11-2008

I'm not talking about cache_aligned_allocator, but rather what promises scalable_malloc gives. I have examined the cache_aligned_allocator, and past 4096 bytes, malloc is called which means (as far as I understand) that all bets are off regarding cache alignment. It seems to me that the best way to ensure alignment is to allocate an extra cache-line size bytes. That way I can shift the data internally a bit, to guarantee alignment. For many small objects, this can double the space requirements. However, I may implement some logic like in the cache_aligned_allocator to decide which technique to use.

RafSchietekat · ‎11-11-2008

Actually, malloc() is used past 8064 bytes (not 4096): scalable_malloc() inflates the requested size by up to slightly over 200% (sic) and returns a pointer at something like 24/32 modulo 16 kB (on a 32/64-bit architecture). So I'm afraid I was referring to my "Additions to atomic" patch, which goes to malloc() only past 16256 bytes, reducing maximum waste to around 100% (so there's still much room for further improvement), and aligns at exactly 16 kB (if I recall correctly this time), which should always give you exact alignment for any multiple of a cache line.

AJ13 · ‎11-12-2008

Actually as Alexey pointed out, small objects will be aligned to the cache-line using scalable_malloc. So I could do a switch much like in the cache_aligned_allocator. Use scalable_malloc for small data sizes, and for larger allocate an extra 64 bytes (or whatever the cache-line length is) and perform the allocation myself. I'll try this approach.

RafSchietekat · ‎11-12-2008

"I'll try this approach." I agree, if you can somehow reproduce the original decision at deallocation time, e.g., by knowing the original request size, which a malloc/free-like interface doesn't. BTW, maximum overhead in my code can be a lot less than 100%, depending on the settings.