Thread heap allocation in NUMA architecture lead to decrease performance

mmmmm__hamed · ‎12-24-2014

hi

i have server that has 80 logical core (model:dl580g7) .I'm running a single thread per core.

each thread doing mkl fft , convolution and many Allocation and DeAllocation from heap with malloc.

i previously have server with 16 logical core and there was not a problem and each thread work on its core with 100% cpu usage.

when i moved my application from that 16 core server to this 80 core server with numa architecture , after create first thread , that thread works on 100%(kernel time 0%) and With the addition of each thread, performance of other thread decrease so that finally when i have 80 thread cpu usage downgrade to 40% (39% kernel time).

because kernel time is increased ,I think the reason for this event is heap sequential mechansim and heap lock. Because of the increasing demand for memory allocation,increased waiting time for each request.

i use createheap() on each thread to eliminate wait for unlock heap memory. but heapalloc can alloc memory up to 512KB. that Insufficient for me.

i use virtuallalloc but lead to decrease thread performance.

To solve this problem, what should I do?

jimdempseyatthecove · ‎12-26-2014

Look at TBB's scalable allocator as a starting point.

Jim Dempsey

mmmmm__hamed · ‎02-18-2015

dear jimdempseyatthecove

hi

Excuse me for this long delay.I was involved in something else

I did further consideration on my code and find that my low cpu usage has no relate to heap sequential mechansim .

because in each thread , I Alloc memory only one time when thread start and then Do MKL Conv in a loop for long time on that memory (execute mkl conv for 10000 time )

even though I doesn't have any memory allocation inside threads loop but still my cpu usage is down (10%) . with this test I concluded that my problem isn't from memory allocation.

After this observation I suspected to memory access in threads and I think memory access in threads can lead to remote memory access between Numa node and then I Comment MKL Conv in threads loop and Only Doing memory access in loop. then I Run Application and saw all cores works on 100% cpu usage. after this test I concluded that my problem isn't from memory access.

thus only execute mkl conv inside threads loop can leads to downgrade cpu usage of cores.

I think mkl has a internal memory manager that can't work correctly on numa memory architecture or need to config for numa.

are you have solution for my problem?

meanwhile my server has 80 logical core and run 80 windows thread and each thread have a loop . inside loop I execute mkl conv.

thanks for you

McCalpinJohn · ‎02-18-2015

You can't use CPU utilization to determine if the processor is stalled on memory accesses (either local or remote) --- a core stalled on a memory reference is "busy" in from the point of view of the OS (and most of the hardware) in exactly the same way as a core that is completing instructions every cycle.

On Linux systems it is easy to estimate the impact of non-local memory accesses using "numactl" to force either local or remote placement and look at the differences in execution time. For example:

time numactl --membind=0 --cpunodebind=0 a.out
time numactl --membind=1 --cpunodebind=0 a.out

I don't know if there is an equivalent mechanism in Windows.

jimdempseyatthecove · ‎02-21-2015

First, seeing logical processors run at 100% is an almost useless indicator. The reasons for this are:

a) You won't know if the logical processor is computing or burning up time inside a barrier or just after(between) parallel region(s).
b) You won't know if the logical processor is stalled waiting for data (from further cache, local RAM, other NUMA node)
c) You won't know if the logical processor is stalled waiting for a mutex (critical section) or and atomic operation (CAS loop)

Second, some MKL functions will call malloc and free for temporary arrays. Normally, malloc obtains the allocation of memory from Virtual Memory space. When that address space had never been used, then accessing that area of memory for the first time will cause a trap to the O/S (this is called first touch) and then the O/S will perform the actual mapping of the virtual address space to RAM (usually on the local NUMA node). After allocation, when the memory is freed, just the virtual address space is return as a free node data structure to the heap (these spaces may or may not be consolidated). The RAM is not actually freed from the process. On subsequent allocation of memory that returns a block that is within the range previously used (already first touched), then that allocation points to the memory wherever it used to reside. This will be the in former NUMA node (assuming NUMA) of the former allocation. There is an exception to this if the memory was paged out.

So what can you do about this with multiple threads on multiple NUMA nodes all sharing the same virtual address space and heap? The temporary array first allocated on NODE0 (and first touched by thread on NODE0), the subsequently being freed then reallocated by other thread on different NODE. Seems like you have no control over this.

In looking at MKL Users Guide, you will find a section Redefining Memory Functions. You can use this to replace the standard memory functions with your own (or someone else's) that is NUMA aware. To take this route, will require you to do some more searching and reading. "Think of the experience you will gain"

Jim Dempsey