Scalable allocation of large (8MB) memory regions on NUMA architectures - Page 2

Tobias_M_ · ‎12-10-2012

I am currently using a TBB flow graph in which a) a parallel filter processes an array (in parallel with offsets) and puts processed results into an intermediate vector (allocated on the heap; mostly the vector will grow up to 8MB). These vectors are then passed to nodes which then postprocess these results based on their characteristics (determined in a)). Because of synchronized resources, there can only be one such node for each characteristic. The prototype we wrote works well on UMA architectures (tested on a single CPU Ivy Bridge and Sandy Bridge architecture). However, the application does not scale on our NUMA architecture (4 CPU Nehalem-EX). We pinned the problem down to memory allocation and created a minimal example in which we have a parallel pipeline that just allocates memory from the heap (via malloc of a 8MB chunk, then memset the 8MB region; similar to what the initial prototype would do) up to a certain amout of memory. Our findings are:

- On a UMA architecture the application scales up linearly with the number of threads used by the pipeline (set via task_scheduler_init)

- On the NUMA architecture when we pin the application to one socket (using numactl) we see the same linear scale-up

- On the NUMA architecutre when we use more than one socket, the time our application runs increases with the number of sockets (negative linear scale-"up")

For us this smells like heap contention. What we tried so far is to substitute Intel"s TBB scalable allocator for the glibc allocator. However, the initial performance on a single socket is worse than using glibc, on multiple sockets performance is not getting worse but also not getting any better. We gained the same effect using tcmalloc and the hoard allocator.

My question is if someone experienced similar issues. Stack-allocation is not an option for us as we want to keep the heap-allocated vectors even after the pipeline ran.

Update: I attached perf stats for the various executions with numactl. Interleaving/localalloc has no effect whatsoever (the QPI bus is not the bottleneck; we verified that with PCM, QPI link load is at 1%).

Update 2: I also added a chart depicting the results for glibc, tbbmalloc, and tcmalloc.

SergeyKostrov · ‎12-18-2012

I'd like to understand how much memory do you actually need to process your array of data? >>...There is no difference between glibc malloc and other scalable allocators such as Intel's TBB scalable_malloc... This is really strange because by design they have different implementations and scalable_malloc has to be faster. I didn't have a chance to look at Linux codes for scalable_malloc and I think you need to take a look at it more closely. I finally decided to use malloc because scalable_malloc didn't give me any performance advantages in case of allocation memory blocks larger than 1.5GB ( and this is what I need ). My actual goal is even larger blocks ( up to 2.8GB with Win32 API function VirtualAlloc ) on 32-bit Windows platforms which support Address Windowing Extensions ( AWE ) technology since it allows to allocate up to 3GB of memory for a 32-bit application. Unfortunately, it is not applicable in your case since you're using Linux. >>...When we memset the memory, this triggers an exception (mind the kernel page is read-only) and a page fault... Absolutely the same happens on Windows platforms and it doesn't surprise me. >>...Address this scalability issue in the Linux kernel... I can't say anything here.

jimdempseyatthecove · ‎12-18-2012

Sergey, Large blocks (in your case 1.5GB) tend to be allocated once, and you keep it around for as long as a block of that size will be used (or later re-used) by your application. For allocations like this, you want to bypass the scalable allocator because there is no performance benefit in using a scalable allocator in this circumstance. Note, once allocated (on a memory limited system) you would not want to return this allocation as it may later get fragmented. You will want to defer the return of this buffer until a buffer of this size will never be required at a later time in the application. A typical scalable allocator will have a slab size wired in (some may have several slab sizes). Where a slab is a large block (4MB or larger) which is allocated by a non-scalable allocator (e.g. VirtualAlloc), and then subsequent to that allocation, it may be (on demand) subdivided into pools of specific sized nodes (16, 32, 64,... bytes) where the specific sized node pool creation is deferred until there is a demand for a node of that size. Any allocation larger than the slap size will either fall into a higher slab size category or else bypass the scalable allocator and go directly to use of non-scalable allocator. In your case (1.5GB) bypassed the scalable allocator memory, iow not comming out of slab, and went "directly" to VirtualAlloc. "directly" does contain additional scalable allocator overhead to classify the size request and to package it for eventual return. Your application allocations of these large blocks are more efficiently performed by specific code that bypasses the scalable allocator. Jim Dempsey

Bernard · ‎12-19-2012

>>>What if these CPUs simply "fight" for access to the same local memory? Could it be the case?>>> Probably NUMA architecture - related memory distances coupled with the thread beign executed by the different nodes and forced to access its non-local memory could be responsible for any performance degradation related to the memory accesses. When the number of nodes is greater than 1 some performance penalty will be expected.IIRC the performance penalty is measured in unit of "NUMA distance" with the normalized value of 10 and every access to the local memory has the cost of 10(normalized value) thats mean 1.0 when the process accesses off-node memory(remote) from the NUMA "point of view" some penalty will be added because of overhead related to moving data over the numa interlink.Accessing neighbouring node can add up to 0.4 performance penalty so the total penalty can reach 1.4. More information can be found in ACPI documentation.

SergeyKostrov · ‎12-20-2012

>>...More information can be found in ACPI documentation... Iliya, please post the link to these docs. Thanks in advance.

Bernard · ‎12-20-2012

>>>Iliya, please post the link to these docs. Thanks in advance.>>> I searched through the ACPI standard, but the information contained there is very scarce.I have only found structure describing distance table,and some explanation regarding numa-distance calculation. Look for SLIT and SRAT data structures. Here is the link : http://www.acpi.info/spec.htm Here is the numa topology diagram ,albeit for Magny Cores architecture link :http://code.google.com/p/likwid-topology/wiki/AMD_MagnyCours8 Very interesting information regarding NUMA performation degradation link :http://communities.vmware.com/thread/391284 Follow also this link :https://docs.google.com/viewer?a=v&q=cache:K06wsPrSIFYJ:cs.nyu.edu/~lerner/spring10/projects/NUMA.pdf+&hl=pl&gl=pl&pid=bl&srcid=ADGEESifVHDcWcZwmY4R0uJ1LCP56lti2Imj0DQQ-oF7I4dNcDsuuGN0fFH6dxu6LJoFgV2x6O44Z6PsLShuyHRkbZXySP1PypGCxrNV96P0DS_z8ePW4kQcGdklYffIeuKfMvRWAzGE&sig=AHIEtbRyrlk1n8pQHsYNonceYEiTmOwAYA Read also this article :http://kevinclosson.wordpress.com/2009/08/14/intel-xeon-5500-nehalem-ep-numa-versus-interleaved-memory-aka-suma-there-is-no-difference-a-forced-confession/ Later today I will search through the chipset specification,which I believe could contain more valuable information.I will post the links.

SergeyKostrov · ‎12-20-2012

Got it. Thanks, Iliya.

Bernard · ‎12-20-2012

>>>Got it. Thanks, Iliya.>>> No problem:) Did a preliminary search through the chipset documentation and sadly NUMA was mentioned only once. Here is the link:http://www.intel.com/content/www/us/en/chipsets/server-chipsets/server-chipset-5500.html It seems that there can not be found any valuable information regarding NUMA API implementation besides Linux numa header files.