Software Tuning, Performance Optimization & Platform Monitoring
Discussion regarding monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform updating.
The Intel sign-in experience has changed to support enhanced security controls. If you sign in, click here for more information.

verifying first-touch memory allocation

Black Belt

Is anyone aware of a basic tool for verifying first-touch memory allocation on a NUMA platform such as Xeon EP?

According to usual expectation, pinning of MPI processes to a single CPU should result in this happening automatically (barring running out of memory, etc.), unless a non-NUMA BIOS option has been selected.

Likewise, OpenMP where data are initialized by a parallel data access scheme consistent with the way they will be used should result in allocation local to the CPU, rather than on remote memory.

For this to work, apparently, MPI or OpenMP libraries have to coordinate with the BIOS.

It seems there might be a way to determine the address ranges which are local to each CPU on a shared memory platform and perform tests to see where each thread is placing its first touch allocation.

As you might guess, I'm looking for verification of suspected performance problems which seem to indicate threads within MPI ranks pinned to certain CPUs consistently using remote memory.

0 Kudos
4 Replies

Hey Tim,

Have you looked at NumaTop ( ? Assuming you are using linux...

It seems like, if you have each thread malloc a big array and repeatedly run through the array, then something like numatop should be able to show the local vs remote stats pretty easily. I've never actually used NumaTop.. I'm just pretty sure the guys who created it know what they are doing.


Black Belt

Thanks, that looks like an interesting option.  It requires building a custom kernel with PEBS latency counters, with the step "build kernel as usual" ( it says that verbatim in the man page) looking a bit daunting.

I was able to build a running kernel (including access to this forum page) according to the numatop instructions as best I understood.  However, numatop says "CPU is not supported." Not surprisingly, at a minimum, the Intel(c) Xeon Phi(tm) would need to be rebuilt for that to run.

GUI tools such as red hat system monitor are still present but show fewer cores than they did under Red Hat (where they didn't see all the cores).

/proc/cpuinfo still looks OK.

Black Belt

The developers confirmed that it's sufficient to add the CPU model number to the list in order to make numatop accept it.

FIrefox has particularly bad memory locality, probably no surprise there.

My application ran around 50% remote memory accesses when running just 1 MPI process (OpenMP threaded across all cores) but shows good locality when running an even number of processes.  Must look elsewhere for problems.

Black Belt

Standard Linux systems track whether they were able to provide pages according to the NUMA policy requested.

You can dump the stats before and after your run using
      cat /sys/devices/system/node/node0/numastat

The output looks like:
      numa_hit 672421856
      numa_miss 632409
      numa_foreign 185449
      interleave_hit 269407187
      local_node 672420899
      other_node 633366

I find the naming a bit confusing, and typically have to run test cases using numactl with various processor and memory binding options to remind myself what they mean.