Solved: Cache aligned allocator alignment

Ritwik_D_1 · ‎05-08-2017

I am seeing the following code in the cache_aligned_allocator. Does TBB blindly use the cache line size of 128 or does it actually check at runtime what the size is?

Because when I do GetLogicalProcessorInformation in Windows I get a cache line size of 64. So does this mean using cache_aligned_allocator is going to waste more memory than it should?

// TODO: use CPUID to find actual line size, though consider backward compatibility
static size_t NFS_LineSize = 128;

size_t NFS_GetLineSize() {
return NFS_LineSize;
}

Alexei_K_Intel · ‎05-10-2017

Hi Ritwik,

cache_aligned_allocator always uses 128-byte alignment. There are several reasons:

Historical: Intel Itanium 2 architecture has 128 cache line for L2 and L3 caches;
Prefetching behavior (in my opinion, the most important reason): Intel architecture has so called "Spatial Prefetcher: This prefetcher strives to complete every cache line fetched to the L2 cache with the pair line that completes it to a 128-byte aligned chunk.

In many cases we use cache_alligned_allocator to improve the performance (e.g. to avoid false sharing or to benefit from aligned memory accesses), so it can waste some memory trying to provide the most efficient memory address. However, if you have huge amount of small objects that you want to allocate independently; perhaps, you do not need cache_alligned_allocator in this case because it can waste more memory than it can improve the performance.

Regards,
Alex

View solution in original post

Alexei_K_Intel · ‎05-10-2017

Hi Ritwik,

cache_aligned_allocator always uses 128-byte alignment. There are several reasons:

Historical: Intel Itanium 2 architecture has 128 cache line for L2 and L3 caches;
Prefetching behavior (in my opinion, the most important reason): Intel architecture has so called "Spatial Prefetcher: This prefetcher strives to complete every cache line fetched to the L2 cache with the pair line that completes it to a 128-byte aligned chunk.

In many cases we use cache_alligned_allocator to improve the performance (e.g. to avoid false sharing or to benefit from aligned memory accesses), so it can waste some memory trying to provide the most efficient memory address. However, if you have huge amount of small objects that you want to allocate independently; perhaps, you do not need cache_alligned_allocator in this case because it can waste more memory than it can improve the performance.

Regards,
Alex

Alexei_K_Intel · ‎05-12-2017

Hi Ritwik,

I played a bit with different alignments (64 vs 128) and got the memory consumption overhead about 1.7x for small objects on my machine. I reworked the tbb::cache_aligned_allocator to aligned_allocator<T, ALIGNMENT> but now it depends on tbbmalloc.dll (while the tbb::cache_aligned_allocator can be used without it). So if the memory overhead is important for your case feel free to use my implementation:

#include "tbb/concurrent_hash_map.h"
#include "tbb/cache_aligned_allocator.h"
#include "tbb/scalable_allocator.h"

#include <iostream>

#include "Windows.h"
#include "Psapi.h"

SIZE_T MemoryUsage() {
    PROCESS_MEMORY_COUNTERS mem;
    GetProcessMemoryInfo(GetCurrentProcess(), &mem, sizeof(mem));
    const SIZE_T MB = 1024 * 1024;
    return mem.PagefileUsage / MB;
}

template<typename T, size_t ALIGNMENT>
class aligned_allocator {
public:
    typedef T value_type;
    typedef value_type* pointer;
    typedef const value_type* const_pointer;
    typedef value_type& reference;
    typedef const value_type& const_reference;
    typedef size_t size_type;
    typedef ptrdiff_t difference_type;
    template<typename U> struct rebind {
        typedef aligned_allocator <U, ALIGNMENT> other;
    };

    aligned_allocator () throw() {}
    aligned_allocator ( const aligned_allocator & ) throw() {}
    template<typename U, size_t A> aligned_allocator (const aligned_allocator <U, A>&) throw() {}

    pointer address(reference x) const {return &x;}
    const_pointer address(const_reference x) const {return &x;}

    //! Allocate space for n objects, starting on a cache/sector line.
    pointer allocate( size_type n, const void* = 0 ) {
        return pointer(scalable_aligned_malloc(n * sizeof(value_type), ALIGNMENT));
    }

    //! Free block of memory that starts on a cache line
    void deallocate( pointer p, size_type ) {
        scalable_aligned_free(p);
    }

    //! Largest value for which method allocate might succeed.
    size_type max_size() const throw() {
        return (~size_t(0)-ALIGNMENT)/sizeof(value_type);
    }

    //! Copy-construct value at location pointed to by p.
    template<typename U, typename... Args>
    void construct(U *p, Args&&... args)
        { ::new((void *)p) U(std::forward<Args>(args)...); }

    //! Destroy value at location pointed to by p.
    void destroy( pointer p ) {p->~value_type();}
};

int main() {
    typedef std::pair<int, int> elem_t;

    //typedef std::allocator<elem_t> allocator_t;
    //typedef aligned_allocator<elem_t, 128> allocator_t;
    //typedef tbb::cache_aligned_allocator<elem_t> allocator_t;
    typedef aligned_allocator<elem_t, 64> allocator_t;

    tbb::concurrent_hash_map<int, int, tbb::tbb_hash_compare<int>, allocator_t> m;

    for (int i = 0; i < 10 * 1000 * 1000; ++i)
        m.insert(std::make_pair(i, i));

    std::cout << "Memory usage: " << MemoryUsage() << "MB" << std::endl;

    return 0;
}

Regards,
Alex

Ritwik_D_1 · ‎05-12-2017

Thanks very much for sharing this code and for your reply before. This helps!