Solved: Flat Mode - Memory Allocation

CPati2 · ‎09-14-2017

Hi All,

The Xeon Phi 7210 I using has following setup:

Cluster mode: Quadrant or All-2-All
Memory mode: Flat

With this configuration, I will have two nodes: node 0 (CPU + DDR4) and node 1 (MCDRAM). I want application to allocate memory on DDR4 and not on MCDRAM. For this I use numactl command as follows:

numactl -m 0 ./sbench

What I am observing as per memory being consumed, is that allocation is happening on MCDRAM (node 1) instead of DDR4. Can anyone please suggest why this is the case?

P.S.: In chapter 23 of "Intel Xeon Phi Processor High Performance Programming: Knights Landing Edition" book I see authors have also performed similar experiment with very small data footprint of 24 MiB (I am using benchmark that needs 8GB) and they show speed up based on whether allocation is occurring at DDR4 vs MCDRAM. However, based on my analysis the allocation always occurs at MCDRAM when in flat mode irrespective of the binding.

Thanks.

CPati2 · ‎09-15-2017

Hi All,

Following worked for me:

export MEMKIND_HBW_NODES=0
numactl -m 0 ./sbench

After digging a lot I found that I need to use "MEMKIND_HBW_NODES" flag to overwrite memkind.

I didn't knew that benchmarks are using memkind. I am still a bit confused on this. Since, I am using DeepBench and Intel® Optimized LINPACK Benchmark for Linux, even if I don't use numactl, MEMKIND_HBW_NODES will be able to divert the application to either MCDRAM or DDR4.

Thanks.

View solution in original post

jimdempseyatthecove · ‎09-15-2017

Chetan,

Run numctl -H to confirm your assumption about which node (0 or 1) has the MCDRAM (use size to disambiguate).

Jim Dempsey

jimdempseyatthecove · ‎09-15-2017

Also, I think there is a BIOS setting relating to where the OS is placed. This may affect C/C++ malloc/new as well.

Have you tried using

void* numa_alloc_onnode(size_t size, int node);

This will remove any questions as to what malloc/new (ALLOCATE) is doing.

Jim Dempsey

CPati2 · ‎09-15-2017

Hi Jim

jimdempseyatthecove wrote:

Run numctl -H to confirm your assumption about which node (0 or 1) has the MCDRAM (use size to disambiguate).

I do run "numactl -H" in parallel terminal to see which memory is being consumed and all I see is that MCDRAM getting consumed as below:

node 0 size: 98178 MB
node 0 free: 94201 MB
node 1 cpus:
node 1 size: 16384 MB
node 1 free: 12 MB

jimdempseyatthecove wrote:

Also, I think there is a BIOS setting relating to where the OS is placed. This may affect C/C++ malloc/new as well.
Have you tried using
void* numa_alloc_onnode(size_t size, int node);
This will remove any questions as to what malloc/new (ALLOCATE) is doing.

For Flat and Memory "Hot-Pluggable" is set and the OS is supposed to use MCDRAM of user programs and DDR4 for Kernel.

Thanks.

jimdempseyatthecove · ‎09-15-2017

>>"Hot-Pluggable" is set and the OS is supposed to use MCDRAM of user programs and DDR4 for Kernel

Apparently your testing shows it did not. Have you modified your test program to use numa_alloc_onnode

Note, you can overload C++ new for specific types to use a different memory allocator than the default allocator.

struct Enable_Use_qt_malloc
{
 bool b;
 Enable_Use_qt_malloc()
 {
  if(qt::tlsThreadContext)
   b = qt::tlsThreadContext->set_Use_qt_malloc(true);
  else
   b = false;
 }
 ~Enable_Use_qt_malloc()
 {
  if(qt::tlsThreadContext)
   qt::tlsThreadContext->set_Use_qt_malloc(b);
 }
};

// main threads state (lives for duration of app)
Enable_Use_qt_malloc Enable_Use_qt_malloc_now;

void * operator new( size_t cb )
{
 if(cb == 0) cb = 1; // allocation of at least 1 byte
 if(qt::tlsThreadContext && qt::tlsThreadContext->Use_qt_malloc)
  return qt::qt_malloc(cb);
 return malloc(cb);
}
void * operator new[]( size_t cb )
{
 if(cb == 0) cb = 1; // allocation of at least 1 byte
 if(qt::tlsThreadContext && qt::tlsThreadContext->Use_qt_malloc)
  return qt::qt_malloc(cb);
 return malloc(cb);
}

void __CRTDECL operator delete(void *p) _THROW0()
{
 if(!p) return;
 if(qt::tlsThreadContext && qt::tlsThreadContext->Use_qt_malloc)
  return qt::qt_free(p);
 free(p);
}

void __CRTDECL operator delete [] (void *p) _THROW0()
{
 if(!p) return;
 if(qt::tlsThreadContext && qt::tlsThreadContext->Use_qt_malloc)
  return qt::qt_free(p);
 free(p);
}

The above is from my QuickThread parallel programming library. Feel free to rename the repurpose the code.

*** You would modify this to replace qt_malloc/qt_free with numa_alloc_onnode, etc...
*** and extend Enable_Use_qt_malloc (renamed to Enable_Use_numa_malloc)
*** to not only enable/disable the NUMA allocator, but also to specify (per thread) a preferred/required node

The purpose of the Enable_Use_qt_malloc (you rename this) is to switch between the default C/C++ allocator and your NUMA allocator

Jim Dempsey

CPati2 · ‎09-15-2017

Hi Jim,

jimdempseyatthecove wrote:

>>"Hot-Pluggable" is set and the OS is supposed to use MCDRAM of user programs and DDR4 for Kernel

*** You would modify this to replace qt_malloc/qt_free with numa_alloc_onnode, etc...
*** and extend Enable_Use_qt_malloc (renamed to Enable_Use_numa_malloc)
*** to not only enable/disable the NUMA allocator, but also to specify (per thread) a preferred/required node

The purpose of the Enable_Use_qt_malloc (you rename this) is to switch between the default C/C++ allocator and your NUMA allocato

Sorry, I am bit lost here.

When "Hot-Pluggable" is ON, then application (user program) goes to MCDRAM, which is correct as per the setting. I am not using any test program, I am using "numactl" to allocate application either to DDR4 or MCDRAM.

Also, as of now my goal is not to write a code (or change a code) that will query the device and then allocate to either DDR4 or MCDRAM. I simply want MKL benchmarks to allocate based on what node I give numactl. If Intel and other resources emphasis so much on directly using numactl to allocate or pin the benchmark to specific memory, then I should be able to validate this approach.

What you suggested above is useful when a code is written from scratch and that too when it uses "memkind" library. Please correct me if I am wrong.

Thanks.

CPati2 · ‎09-15-2017

Hi All,

Following worked for me:

export MEMKIND_HBW_NODES=0
numactl -m 0 ./sbench

After digging a lot I found that I need to use "MEMKIND_HBW_NODES" flag to overwrite memkind.

I didn't knew that benchmarks are using memkind. I am still a bit confused on this. Since, I am using DeepBench and Intel® Optimized LINPACK Benchmark for Linux, even if I don't use numactl, MEMKIND_HBW_NODES will be able to divert the application to either MCDRAM or DDR4.

Thanks.