Is anyone aware of a basic tool for verifying first-touch memory allocation on a NUMA platform such as Xeon EP?
According to usual expectation, pinning of MPI processes to a single CPU should result in this happening automatically (barring running out of memory, etc.), unless a non-NUMA BIOS option has been selected.
Likewise, OpenMP where data are initialized by a parallel data access scheme consistent with the way they will be used should result in allocation local to the CPU, rather than on remote memory.
For this to work, apparently, MPI or OpenMP libraries have to coordinate with the BIOS.
It seems there might be a way to determine the address ranges which are local to each CPU on a shared memory platform and perform tests to see where each thread is placing its first touch allocation.
As you might guess, I'm looking for verification of suspected performance problems which seem to indicate threads within MPI ranks pinned to certain CPUs consistently using remote memory.
Have you looked at NumaTop (https://01.org/numatop) ? Assuming you are using linux...
It seems like, if you have each thread malloc a big array and repeatedly run through the array, then something like numatop should be able to show the local vs remote stats pretty easily. I've never actually used NumaTop.. I'm just pretty sure the guys who created it know what they are doing.
Thanks, that looks like an interesting option. It requires building a custom kernel with PEBS latency counters, with the step "build kernel as usual" ( it says that verbatim in the man page) looking a bit daunting.
I was able to build a running kernel (including access to this forum page) according to the numatop instructions as best I understood. However, numatop says "CPU is not supported." Not surprisingly, at a minimum, the Intel(c) Xeon Phi(tm) would need to be rebuilt for that to run.
GUI tools such as red hat system monitor are still present but show fewer cores than they did under Red Hat (where they didn't see all the cores).
/proc/cpuinfo still looks OK.
The developers confirmed that it's sufficient to add the CPU model number to the list in order to make numatop accept it.
FIrefox has particularly bad memory locality, probably no surprise there.
My application ran around 50% remote memory accesses when running just 1 MPI process (OpenMP threaded across all cores) but shows good locality when running an even number of processes. Must look elsewhere for problems.
Standard Linux systems track whether they were able to provide pages according to the NUMA policy requested.
You can dump the stats before and after your run using
The output looks like:
I find the naming a bit confusing, and typically have to run test cases using numactl with various processor and memory binding options to remind myself what they mean.