Intel® Moderncode for Parallel Architectures
Support for developing parallel programming applications on Intel® Architecture.

Numa Examples

duncanhopkins
Beginner
2,798 Views
Hi,

I am trying to play with libnuma on the intel MTL machines.

So far I have managed to set policy on threads and memory allocations, but am seeing a large 4x slow down when I try this.

Can anybody suggest some good examples or practical walk throughts on using libnuma?
I have been able to find loads of copies on the man pages but very littl on actual discussions and working examples.

Thanks.
0 Kudos
5 Replies
jimdempseyatthecove
Honored Contributor III
2,798 Views
From my experimentation with libnuma on the MTL systems I think there are (were) two issues but your comment does not quite line up with my experience. The MTL system is using the RHEL distro which is behind the revision of the Ubuntu distro. You may have some versioning issues.

A few weeks ago, my experimentation lead me to conclude that the BIOS settings were set to interleave memory from the NUMA nodes (IOW memory performs like one larger node). There is a different setting in the BIOS to represent each NUMA node as a separate block of physical memory. My observations was a setting for interleaved that is non-NUMA... meaning UMA (average of accesses will be uniform accross random addresses).

An additional problem, not fully researched by me, I suspect some playing around with libnuma in setting the default memory allocation policy to "first touch". While this feature may work well with non-NUMA aware applications, it can also get into your way of your application attempts atmanaging data placement. I cannot attest that "first touch" is enabled. There may be an API to specify your default preference.

RE 4x slowdown. Access time difference on MTL systemsbetween local node and remote node (assuming NUMA not set to interleave) is not 4x different. I am not quite sure what it is, but I have seen some benchmarks on anandtec that leads me to assume it is more like 1.3x (30% slower). If you are seeing 4x, then I suspect something quite different is going on. Note, from little working experience I have with "first touch" there seems to be additional tuning options, not knowing the name at this time, there seemed to be a "subsequent touch". The "first touch" maps (assigns) virtual memory pages (4KB or 4MB) to the node of the logical processor first touching the memory. The(whatever the name) "subsequent touch" says to the effect: mark this section of memory (e.g. as an array) as not being accessible by any thread. Then upon subsequent touch by any thread. The first to touch will cause one of: a) if touchedby thread on same node, a simple mapping to existing position on physical NUMA node, or b) on touch by thread in different NUMA node, then slurpy copy the data to the other node, release memory of prior node, then map virtual memory memory on new NUMA node.

With this type of behind the scenes schennanigans you could see 4x slowdown (+/- depending on how often the 4KB/4MB data slurps).

When allocating/binding data to NUMA nodes it is almost a prerequisite that you affinity pin your threads as well.

Jim Dempsey
0 Kudos
jimdempseyatthecove
Honored Contributor III
2,798 Views
FWIW

The man pages are absolutely horrible. Googling on the internet is worse from the perspective of you do not know the quality of the information you receive,nor if the version the information you find relates to the library contained on your system. I consider myself a relatively smart programmer with over 40 years of system level programming (including authoring multiple operating systems with shared memory and clusters). Finding an authoratative set of programming guides for your specific distribution of libnuma is very hard. I still am not satisfied with the ones (several) I use. Conflicting information, no good (lack of or bad) examples,lack offull documentation. Most of the documentation simply translates the function signature into a sentence. Hardly helpful.

If you want, I would be willing to take a cursory view of your code and make some comments. You can email me at

jim (dot) (zero) (zero) (dot) dempsey (at) gmail (dot) com

or at my business address

Jim Dempsey
jim@quickthreadprogramming.com
0 Kudos
TimP
Honored Contributor III
2,798 Views
Not understanding what exactly is the question asked, nevertheless I'll jump in.
NUMA BIOS options typically mean replacing a default setting under which consecutive cache lines rotate among memory segments (interleaved, as Jim said) by a setting under which each memory block contains a contiguous set of cache lines. Such non-NUMA default settings, on a 2 socket platform, typically give you at least 50% of peak memory performance regardless of any efforts to set affinity. When you select the NUMA BIOS option, you take on the responsibility for setting affinity, as with KMP_AFFINITY for Intel OpenMP (GOMP_CPU_AFFINITY for gcc OpenMP). Even a single threaded task may require running under taskset() or equivalent to maintain adequate performance. With MPI, openmpi has good facilities for NUMA affinity, if you invoke them in accordance with FAQ. Intel MPI attempts to set useful defaults at least for Intel platforms, but depending on defaults may require you to upgrade when supporting a new multi-core CPU. Recent Intel MPI versions have full support for affinity on clusters of Intel NUMA nodes.
NUMA BIOS options first became relevant on Intel platforms with the Xeon Nehalem and Itanium Montecito multiple CPU server introductions.
As Jim mentioned, good OpenMP (and presumably Intel C++ specific threading libraries) support a first touch scheme for automatically placing NUMA allocations according to where they are first used. Many practical applications don't fit the model supported by first touch, but you will be hard put to find any documentation of what can be expected in such cases.
Those environment variable settings I just mentioned are simply (possibly) convenient ways of invoking taskset() on linux, or equivalent facilities supported in the latest Windows versions. They will have no effect for pthreads (or Windows threading) unless a dummy OpenMP parallel region (including an OpenMP function call such as omp_get_num_threads) is set up so as to invoke the affinity facility.
Intel C++ specific threading facilities (Cilk+, TBB, CEAN) have their own affinity scheme.
I'm not catching on to any specific meaning of your MTL terminology.
0 Kudos
duncanhopkins
Beginner
2,798 Views
Hi Guys,
First off thanks for the help.

Yes, I have noticed the differences of libnuma on Ubuntu and RHEL. I do most of my development on ubuntu (numv v2) and occasionally get a chance to test it on the MTL machines (numa v1) so am playing with both versions of the api!

The 4x slow down, instead of ~30%, I think is due to some thing that I was trying. One data store per numa node, just for the threads on that node. So I now have 4 data store on the go, which I think is why it is 4x.
With just one data store the times I see when I turn on numa support is closer to ~30%.

What I have been doing to using numa is as follow.

* Replace a number of new() and delete() with numa_alloc_onnode() and numa_alloc_local().
* When a thread starts up use the following calls on it:
. numa_run_on_node()
. numa_set_preferred()
or
. numa_bind()

Using these always seems to produce a slow down. So to double check that I have done these things correctly I have been using numa_get_run_node_mask() and get_mempolicy() to obtain the policies of the threads. Which seem to be correct.
I have not been able to use get_mempolicy() or move_pages() to get the node location of any memory.

My suspicions as to why things are going slow are that the stack memory is on a different node. (and some other memory). Is that a normal problem?

How would I force pthread_create() to create a thread that only runs on a particular numa node and has its stack and other resource allocated on that node?
Or which methods should I use to move memory pages onto a particular node after allocation?

Thanks.
0 Kudos
jimdempseyatthecove
Honored Contributor III
2,798 Views
Check your runtime library API for a function to change the stack size. After you affinity pin your thread, then attempt to change your stack size (larger). This should cause an allocation. Then depending on the allocation scheme in effect it _may_ allocate on the current NUMA node (no guarantee from me that this will). A more assured route would be to allocate a block then do your own stack swap magic. (however I suggest digging around in the scant libnuma docs as you may finde the API to release association and remap may take on a pointer and a byte count. This was the "first retouch" (or whatever its called) that I refered to. With this in place you can force your new stack allocations to come from desired NUMA node.

*** CAUTION ***

You may find it problematic to mark the virtual memory page where the current stack pointer points to as mark for first retouch. Reason being the page fault for accessing the current (near) top of stack will double fault (you use the stack, it faults, then the page fault faults). So your strategy might be:

1) set your affinities
2) from main/PROGAM issue "first retouch" (or whatever its called) for address range:
current scope local variable - page size - your working stack size
through
current scope local variable - page size
3) then call function or enter scope that consumes page size object off stack
4) then call mainSurrogate (code that now manages your application)

As I indicated earlier, an API might be available that will make the above code not necessary.

P.S. If you identify the API, please report back here with your experience
e.g. "This workes on NUMA v2.1.2.3"

Jim Dempsey
0 Kudos
Reply