Intel® oneAPI Threading Building Blocks
Ask questions and share information about adding parallelism to your applications when using this threading library.
2464 Discussions

Scalable allocation of large (8MB) memory regions on NUMA architectures

Tobias_M_
Beginner
1,169 Views

I am currently using a TBB flow graph in which a) a parallel filter processes an array (in parallel with offsets) and puts processed results into an intermediate vector (allocated on the heap; mostly the vector will grow up to 8MB). These vectors are then passed to nodes which then postprocess these results based on their characteristics (determined in a)). Because of synchronized resources, there can only be one such node for each characteristic. The prototype we wrote works well on UMA architectures (tested on a single CPU Ivy Bridge and Sandy Bridge architecture). However, the application does not scale on our NUMA architecture (4 CPU Nehalem-EX). We pinned the problem down to memory allocation and created a minimal example in which we have a parallel pipeline that just allocates memory from the heap (via malloc of a 8MB chunk, then memset the 8MB region; similar to what the initial prototype would do) up to a certain amout of memory. Our findings are:

- On a UMA architecture the application scales up linearly with the number of threads used by the pipeline (set via task_scheduler_init)

- On the NUMA architecture when we pin the application to one socket (using numactl) we see the same linear scale-up

- On the NUMA architecutre when we use more than one socket, the time our application runs increases with the number of sockets (negative linear scale-"up")

For us this smells like heap contention. What we tried so far is to substitute Intel"s TBB scalable allocator for the glibc allocator. However, the initial performance on a single socket is worse than using glibc, on multiple sockets performance is not getting worse but also not getting any better. We gained the same effect using tcmalloc and the hoard allocator.

My question is if someone experienced similar issues. Stack-allocation is not an option for us as we want to keep the heap-allocated vectors even after the pipeline ran.

Update: I attached perf stats for the various executions with numactl. Interleaving/localalloc has no effect whatsoever (the QPI bus is not the bottleneck; we verified that with PCM, QPI link load is at 1%).

Update 2: I also added a chart depicting the results for glibc, tbbmalloc, and tcmalloc.

0 Kudos
27 Replies
Vladimir_P_1234567890
997 Views
Hello, Was the tbbmalloc taken from 4.1 update 1? --Vladimir
0 Kudos
Tobias_M_
Beginner
997 Views
The perf stats are for 4.0; just ran the prototype with tbb41_20121003oss (4.1 update 1) and got the same result though.
0 Kudos
SergeyKostrov
Valued Contributor II
997 Views
>>...My question is if someone experienced similar issues. Stack-allocation is not an option for us as we want to keep the heap-allocated >>vectors even after the pipeline ran... I reported some problems with 'scalable_allocator' a couple of months ago and I had no clear response from TBB team. Take a look at a thread: Forum topic: The memory manager cannot access sufficient memory to initialize; exiting Web-link: http://software.intel.com/en-us/forums/topic/277187
0 Kudos
Vladimir_P_1234567890
997 Views
Tobias M. wrote:

The perf stats are for 4.0; just ran the prototype with tbb41_20121003oss (4.1 update 1) and got the same result though.

OK thanks. we'll take a look and come back. --Vladimir
0 Kudos
Tobias_M_
Beginner
997 Views
Thanks. If you need the code sample, ping me.
0 Kudos
Alexandr_K_Intel1
997 Views
Tobias, Yes, could you please send the reproducer?
0 Kudos
Tobias_M_
Beginner
997 Views
Attached prototype.cpp (as .txt, does not accept .cpp). Compiled with g++-4.7 std=c++11 march=native -O3.
0 Kudos
Alexandr_K_Intel1
997 Views
Thanks, got it! Just to check my understanding: are you interested in raw allocation and memory touching performance, without subsequent releasing? And during performance measurement you use something like “numactl --mebind=0 -- ./foo”, i.e. without CPU binding?
0 Kudos
Tobias_M_
Beginner
997 Views
Yes, I'm interested in multithreaded raw allocation and memory touching of say 8MB chunks without releasing the memory within the pipeline (the memory is freed only when the application is closed). For performance measurements I used numactl --cpunodebind=0 ... numactl --cpunodebind=0-3 (we have 4 CPUs). Memory binding, interleaving (--interleave=all) and --localalloc have no effect here. With CPU binding we have the following finding: 1 CPU (speed x), 2 CPUs (speed ~ x/2), 3 CPUs (speed ~ x/3) [speed=the higher the better]. So what we see is that performance worsens with the number of CPUs.
0 Kudos
Tobias_M_
Beginner
997 Views
Our system configuration for reference: 4 socket Nehalem-EX (4 X7560 CPUs) with a 5520/5500/X58 chipset and 1 TB main memory (256 GiB per socket).
0 Kudos
Tobias_M_
Beginner
997 Views
Our system configuration for reference: 4 socket Nehalem-EX (4 X7560 CPUs) with a 5520/5500/X58 chipset and 1 TB main memory (256 GiB per socket).
0 Kudos
jimdempseyatthecove
Honored Contributor III
997 Views
A few comments on your test program: First off, the program is only performing allocation. Scalable allocators are tuned for reallocation of nodes previously returned after allocation. IOW first allocation may suffer some additional overhead at the benefit of reduced return and reallocation times. Secondly, in parallel_pipeline designs you pass a token, containing context, meaning not only can buffer pointers be passed through the stages of the pipeline, but also they can persist through recycling of the token (with context) from the back-end of the pipeline back to the front-end of the pipeline. IOW, once allocated, keep the 8MB buffer (pointer) inside the token. Only when larger buffer required, would your code free then reallocate. Thirdly, run your performance tests longer than 1.66 seconds. Jim Dempsey
0 Kudos
RafSchietekat
Valued Contributor III
997 Views
I didn't feel that I had sufficient information to say something sensible like Jim may have (depending on what the actual details are). The original question is about a flow graph, but the example code is about a parallel_pipeline, not quite the same. Will the data be scattered or can good cache locality be expected? If the latter, I would allocate at the start and deallocate at the end (here my advice differs from Jim's second comment, because the tasks that hold the tokens may be stolen, so keeping the memory regions with the tokens may be counterproductive) and let the scalable allocator automatically deliver excellent performance after an initial warm-up period that should be ignored in any useful benchmark (here I concur with Jim's first and third comments); this depends on how well these characteristics are now available for allocations of about 8 MB compared to an earlier implementation that merely delegated them to malloc(), otherwise it might be better to roll your own pool based on TLS, but I haven't investigated this (yet). And again, I don't see the full picture yet.
0 Kudos
jimdempseyatthecove
Honored Contributor III
997 Views
Raf, >>because the tasks that hold the tokens may be stolen, so keeping the memory regions with the tokens may be counterproductive Not so Raf. TBB does not have preemption. Tasks run to completion. Threads (and the task they are currently running) can get preempted by threads of other applications (or oversubscribed threads of same application). When the last filter completes in a parallel_pipeline, the token containing the buffer pointer(s) get recycled and are either immediately re-consumed by the thread exiting the pipeline or are enqueued and become available for any available thread to use (the choice is dependent on the implimentation). I do agree with you that unless the parallel_pipeline is using all threads, that the buffer pointers NOT be in TLS (i.e. place them inside the token). Jim Dempsey
0 Kudos
RafSchietekat
Valued Contributor III
997 Views
Hmm, it's been a while since I've been fiddling with the implementation of a pipeline (back when tasks still had depth!). Perhaps something changed in this regard as well, perhaps I just forgot? I've had another look (release 4.1), and after the last stage a stage_task does bypass the scheduler to wrap around to the beginning, so locality would be preserved. However, unless it's documented, a new implementation might change that again. And also now I'm confused about how you would even "keep the 8MB buffer (pointer) inside the token", because a token is just an invisible and therefore inaccessible implementation issue, isn't it? You set the maximum number of tokens, but that's it as far as tokens go. I'll have another look later, because now I'm really starting to get worried. :-)
0 Kudos
SergeyKostrov
Valued Contributor II
997 Views
I see that the problem is identical to what I've experienced a couple of months ago. There is a performance problem with scalable_malloc after some number of allocations is completed by the function. As soon as at least 1.5GB of memory is allocated the performance decreases and a regular CRT-based malloc ( heap based ) outperforms the scalable_malloc. Here are a couple of more notes related to the test-case provided. >>...However, the application does not scale on our NUMA architecture (4 CPU Nehalem-EX). We pinned the problem down to memory >>allocation and created a minimal example in which we have a parallel pipeline that just allocates memory from the heap (via malloc >>of a 8MB chunk, then memset the 8MB region... For example, on a Windows platform scalable_malloc doesn't allocate memory from the heap and it uses a Win32 API function VirtualAlloc. Then, VirtualAlloc is called with MEM_COMMIT flag and it initializes the allocated memory to 0. So, a call to: ... memset( buffer, 0, chunkSize ); ... looks like useless and I would try not to do it. What if memset is "responsible" for that performance decrease? Just comment it and run your tests again. Raf, if you're going to look at it could you verify what happens when the test is executed on a Linux platform? Does it allocate the memory from the heap? >>...For us this smells like heap contention... As I already mentioned, on a Windows platform scalable_malloc doesn't allocate memory from the heap. Just to clear as many as possible uncertanties I recommend to create a test-case that simply verifies scalability of scalable_malloc without using any additional TBB features / classes and calls to any CRT mem-functions. What if there is actually a problem with the parallel_pipeline functionality?
0 Kudos
jimdempseyatthecove
Honored Contributor III
997 Views
>>And also now I'm confused about how you would even "keep the 8MB buffer (pointer) inside the token", because a token is just an invisible and therefore inaccessible implementation issue, isn't it? The token would be hidden in your base class, the buffer or buffer pointers would be inside the derived class. Be mindful that the number of buffers == number of tokens. Also, if (when) using affinity pinning, perform the allocation and first touch inside the parallel compute filter. This may require a priming NULL buffer to pass once through the (serial) input filter and a NOP through the remainder of the pipeline on the first pass. Jim Dempsey
0 Kudos
RafSchietekat
Valued Contributor III
997 Views
"The token would be hidden in your base class, the buffer or buffer pointers would be inside the derived class." Base class and derived class of what?
0 Kudos
SergeyKostrov
Valued Contributor II
997 Views
Tobias, Could you provide some update on the status of the problem, please? Also, I don't think that it is possible for many of us to reproduce the problem on a NUMA system. I personally don't have access to any. >>... >>- On the NUMA architecutre when we use more than one socket, the time our application runs increases with the number of sockets >>(negative linear scale-"up") >>... Is there any possibility that the test application always allocates 8MB memory blocks from a local memory for CPUs selected for processing? >>...2 CPUs (speed ~ x/2), 3 CPUs (speed ~ x/3)... What if these CPUs simply "fight" for access to the same local memory? Could it be the case? Does TBB have any control of NUMA related features, like access to local or remote memories? I could be wrong with my assumptions but it looks like a performance drop could be caused by a different problem not related to TBB and you need to take it into consideration. Best regards, Sergey
0 Kudos
Tobias_M_
Beginner
812 Views
A short status update and a partial answer for the described issue: The call to malloc or scalable_malloc are not the bottleneck, the bottleneck are rather the page faults triggered by memsetting the allocated memory. There is no difference between glibc malloc and other scalable allocators such as Intel's TBB scalable_malloc: for allocations larger than a specific threshold (usually 1MB if nothing is freed; can be defined by madvise) the memory will be allocated by an anoymous mmap. Initially all pages of the map point to a kernel-interneal page that is pre-0ed and read-only. When we memset the memory, this triggers an exception (mind the kernel page is read-only) and a page fault. A new page will be 0ed at this time. Small pages are 4KB so this will happen 2048 times for the 8MB buffer we allocate and write. What I measured is that these page faults are not so expensive on single-socket machines but get more and more expensive on NUMA machines with multiple CPUs. Solutions I came up so far: - Use huge pages: helps but only delays the problem - Use a preallocated and pre-faulted (either memset or mmap + MAP_POPULATE) memory region (memory pool) and allocate from there: helps but one not necessarily wants to do that - Address this scalability issue in the Linux kernel
0 Kudos
Reply