From my experimentation with libnuma on the MTL systems I think there are (were) two issues but your comment does not quite line up with my experience. The MTL system is using the RHEL distro which is behind the revision of the Ubuntu distro. You may have some versioning issues.
A few weeks ago, my experimentation lead me to conclude that the BIOS settings were set to interleave memory from the NUMA nodes (IOW memory performs like one larger node). There is a different setting in the BIOS to represent each NUMA node as a separate block of physical memory. My observations was a setting for interleaved that is non-NUMA... meaning UMA (average of accesses will be uniform accross random addresses).
An additional problem, not fully researched by me, I suspect some playing around with libnuma in setting the default memory allocation policy to "first touch". While this feature may work well with non-NUMA aware applications, it can also get into your way of your application attempts atmanaging data placement. I cannot attest that "first touch" is enabled. There may be an API to specify your default preference.
RE 4x slowdown. Access time difference on MTL systemsbetween local node and remote node (assuming NUMA not set to interleave) is not 4x different. I am not quite sure what it is, but I have seen some benchmarks on anandtec that leads me to assume it is more like 1.3x (30% slower). If you are seeing 4x, then I suspect something quite different is going on. Note, from little working experience I have with "first touch" there seems to be additional tuning options, not knowing the name at this time, there seemed to be a "subsequent touch". The "first touch" maps (assigns) virtual memory pages (4KB or 4MB) to the node of the logical processor first touching the memory. The(whatever the name) "subsequent touch" says to the effect: mark this section of memory (e.g. as an array) as not being accessible by any thread. Then upon subsequent touch by any thread. The first to touch will cause one of: a) if touchedby thread on same node, a simple mapping to existing position on physical NUMA node, or b) on touch by thread in different NUMA node, then slurpy copy the data to the other node, release memory of prior node, then map virtual memory memory on new NUMA node.
With this type of behind the scenes schennanigans you could see 4x slowdown (+/- depending on how often the 4KB/4MB data slurps).
When allocating/binding data to NUMA nodes it is almost a prerequisite that you affinity pin your threads as well.
Jim Dempsey