Solved: How can allocate memory within socket(NUMA node) manually in multi socket system?

Soh__Mingyun · ‎01-25-2018

I have a quadra-socket system on RHEL and I'm testing a job that takes about 1-day runtime.

So it's important to bind the process within sockets because NUMA node makes a runtime variation.

The problem is when the job takes about 50% of node 0's memory, OS makes the job use node 3's memory.

I used "taskset -c", but it's showing the same result.

Can I make a job using node 0's memory fully and then use another node's memory?

McCalpinJohn · ‎01-26-2018

The default allocation policy on Linux is "local", so if the job is bound to socket 0, it will attempt to allocate its pages on socket 0. When the free pages on socket 0 are depleted, pages will be allocated on other nodes. The command "numactl --localalloc" does not actually do anything, since local allocation is the default. As far as I can tell, the command "numactl --preferred=0" also does nothing (assuming that node 0 is the local node).

The "numactl" command does have an option to force pages to be allocated to a specified socket (or group of sockets) -- "numactl --membind=0". With this command, when the free pages on socket 0 are depleted, the OS will work very hard to free up more pages on socket 0. This includes writing back dirty filesystem cache pages, dropping clean filesystem cache pages, etc. If, after going to all this extra effort, the OS is unable to allocate a page on socket 0, the user job will be aborted.

What Linux is missing is the middle ground between these two policies -- one that would try hard to free up pages on the target node, but which would then allocate remotely if no more pages can be freed on the target node. Since Linux has no such policy, all you can do is attempt to emulate the steps taken by "numactl --membind". The easiest and most important of these is to drop the filesystem caches before starting your job. While this does not completely eliminate the excess OS memory usage on socket 0, it does significantly reduce it.

View solution in original post

McCalpinJohn · ‎01-26-2018

The default allocation policy on Linux is "local", so if the job is bound to socket 0, it will attempt to allocate its pages on socket 0. When the free pages on socket 0 are depleted, pages will be allocated on other nodes. The command "numactl --localalloc" does not actually do anything, since local allocation is the default. As far as I can tell, the command "numactl --preferred=0" also does nothing (assuming that node 0 is the local node).

The "numactl" command does have an option to force pages to be allocated to a specified socket (or group of sockets) -- "numactl --membind=0". With this command, when the free pages on socket 0 are depleted, the OS will work very hard to free up more pages on socket 0. This includes writing back dirty filesystem cache pages, dropping clean filesystem cache pages, etc. If, after going to all this extra effort, the OS is unable to allocate a page on socket 0, the user job will be aborted.

What Linux is missing is the middle ground between these two policies -- one that would try hard to free up pages on the target node, but which would then allocate remotely if no more pages can be freed on the target node. Since Linux has no such policy, all you can do is attempt to emulate the steps taken by "numactl --membind". The easiest and most important of these is to drop the filesystem caches before starting your job. While this does not completely eliminate the excess OS memory usage on socket 0, it does significantly reduce it.

Soh__Mingyun · ‎01-30-2018

Thank you very much for your teaching.

Could you let me know how to drop the filesystem caches?

McCalpin, John wrote:

The default allocation policy on Linux is "local", so if the job is bound to socket 0, it will attempt to allocate its pages on socket 0. When the free pages on socket 0 are depleted, pages will be allocated on other nodes. The command "numactl --localalloc" does not actually do anything, since local allocation is the default. As far as I can tell, the command "numactl --preferred=0" also does nothing (assuming that node 0 is the local node).

The "numactl" command does have an option to force pages to be allocated to a specified socket (or group of sockets) -- "numactl --membind=0". With this command, when the free pages on socket 0 are depleted, the OS will work very hard to free up more pages on socket 0. This includes writing back dirty filesystem cache pages, dropping clean filesystem cache pages, etc. If, after going to all this extra effort, the OS is unable to allocate a page on socket 0, the user job will be aborted.

What Linux is missing is the middle ground between these two policies -- one that would try hard to free up pages on the target node, but which would then allocate remotely if no more pages can be freed on the target node. Since Linux has no such policy, all you can do is attempt to emulate the steps taken by "numactl --membind". The easiest and most important of these is to drop the filesystem caches before starting your job. While this does not completely eliminate the excess OS memory usage on socket 0, it does significantly reduce it.

McCalpinJohn · ‎01-31-2018

Dropping caches in Linux can be done using two equivalent interfaces, and can be done at three different levels of aggressiveness.

Reference: https://www.kernel.org/doc/Documentation/sysctl/vm.txt (search for "drop_caches" in the page).

The two (equivalent) approaches (which must be run as root) are:

sync; echo 1 > /proc/sys/vm/drop_caches
sync; /sbin/sysctl -w vm.drop_caches=1

In our batch production environment, we run this in the epilog of every job. This has dramatically reduced the problems that we have seen with NUMA allocation failures on socket 0 due to excessive OS memory usage there.....

Soh__Mingyun · ‎02-15-2018

A lot has helped, thanks.
I run a test that takes about 24hours with 4 core, but NUMA causes runtime variation.
So I bound the core and the variation was reduced, but sometimes runtime variation occurs and I am thinking that memory is the cause.
numactl --membind causing problems when memory is insufficient.
So I'm thinking using numactl --physcpubind --preferred. Is there anything I'm thinking wrong? Is it right to control memory?
It seems that the file system cache is included in used memory and is not able to use enough memory of the node and goes to another node. Is it right?
Or is it using the memory of the binding node and moving the data there to the memory of another node? It's too hard world..

McCalpin, John wrote:

Dropping caches in Linux can be done using two equivalent interfaces, and can be done at three different levels of aggressiveness.

Reference: https://www.kernel.org/doc/Documentation/sysctl/vm.txt (search for "drop_caches" in the page).

The two (equivalent) approaches (which must be run as root) are:

sync; echo 1 > /proc/sys/vm/drop_caches

sync; /sbin/sysctl -w vm.drop_caches=1

In our batch production environment, we run this in the epilog of every job. This has dramatically reduced the problems that we have seen with NUMA allocation failures on socket 0 due to excessive OS memory usage there.....

jimdempseyatthecove · ‎02-15-2018

>>It seems that the file system cache is included in used memory...

You should experiment with reducing/restricting the file system cache.

Jim Dempsey

Soh__Mingyun · ‎02-19-2018

I checked file system cache is included in used memory.

But I'm trying to make a good result without dropping file system cache.

jimdempseyatthecove wrote:

>>It seems that the file system cache is included in used memory...

You should experiment with reducing/restricting the file system cache.

Jim Dempsey

McCalpinJohn · ‎02-20-2018

It is definitely a good idea to use thread binding to prevent the working processes and threads from moving away from their data. The "numactl" and "taskset" commands are both suitable for cases where the process's threads do not span multiple NUMA nodes. Binding processes to nodes (with the "-cpunodebind" option) is sufficient to ensure NUMA affinity. Binding to specific logical processors is typically only required if you are running with HyperThreading enabled, and you want to run only one process (or thread) per physical core, and your OS is not smart enough to schedule your processes on separate physical processors.

As far as I can tell (e.g., from https://www.kernel.org/doc/Documentation/vm/numa_memory_policy.txt), Linux supports only four policies

Default (almost always "local, preferred")
Bind (specific target(s), mandatory)
Preferred (like "local", but starting with specified target node(s), rather than the local node)
Interleaved

Although I can't find it discussed in the Linux documentation, my experience has been that only the "Bind" option goes to the extra effort of freeing pages in order to satisfy the policy.

Soh__Mingyun · ‎04-06-2018

Dear, Dr. McCalpin.

I appreciate a lot of your teaching.

It was very helpful to my work.

I solved my problem by service daemon "numad".

But one problem occurred.

I tested with Intel V3, V5 and E5-4655(4CPU, 24core) machine.

In V3 and V5, there's no problem using numad with a multi-process job.

But in E5-4655 machine, processers did not run as I intended.

For example, I ran 8CPU job, processors worked in only 1 NUMA(6CPU).

Also, 13CPU, 16CPU jobs showed the same phenomenon(I used -u 100/110).

It's really puzzling(I turned off hyperthreading).

Do you have any idea about this?

Your idea would be highly appreciated.

Respectfully,

Min-Gyun Soh

McCalpin, John wrote:

It is definitely a good idea to use thread binding to prevent the working processes and threads from moving away from their data. The "numactl" and "taskset" commands are both suitable for cases where the process's threads do not span multiple NUMA nodes. Binding processes to nodes (with the "-cpunodebind" option) is sufficient to ensure NUMA affinity. Binding to specific logical processors is typically only required if you are running with HyperThreading enabled, and you want to run only one process (or thread) per physical core, and your OS is not smart enough to schedule your processes on separate physical processors.

As far as I can tell (e.g., from https://www.kernel.org/doc/Documentation/vm/numa_memory_policy.txt), Linux supports only four policies

Default (almost always "local, preferred")

Bind (specific target(s), mandatory)

Preferred (like "local", but starting with specified target node(s), rather than the local node)

Interleaved

Although I can't find it discussed in the Linux documentation, my experience has been that only the "Bind" option goes to the extra effort of freeing pages in order to satisfy the policy.

jimdempseyatthecove · ‎04-06-2018

The motherboard BIOS has a setting that permits all memory to be viewed as one node (often this is called Interleaved) or as multiple nodes. Note, I have observed some BIOS manuals use the term interleaved backwards.

Jim Dempsey

McCalpinJohn · ‎04-06-2018

I don't know anything about "numad" -- we don't seem to have it installed on our systems -- so I don't know whether the binding problem you are seeing on the Xeon E5-4655 system is related to "numad" or to something else....

Gandharv_K_ · ‎09-10-2018

Hi Guys,

This is great information, thank you! I do have a follow-up question. What are my options on Windows to achieve the correct NUMA mapping?

I found the following from this link;

On Windows* OS, there isn’t a command equivalent to numactl. When NUMA is enabled on Windows* OS, the only memory allocation policy is “local”. For applications that need interleaved memory mapping across nodes on a multi-socket machine, NUMA has to be disabled.

Is this really the case or are there any new tools like numactl that I can use on Windows to control NUMA mapping. I also came across a NUMA API for Windows but I don't think this is supported on Linux. It is critical I find a way of controlling NUMA mapping in an OS-agnostic way. Any help will be much appreciated.

Thanks,

Gandharv

McCalpinJohn · ‎09-10-2018

There is an extensive library of NUMA APIs for Linux systems. See https://linux.die.net/man/3/numa

Gandharv_K_ · ‎09-10-2018

Linux is not the issue. I want something for Windows and specifically something that works on both Windows and Linux.

anh__ngoc · ‎09-12-2018

If you create a Multi-Vendor Team, you must manually verify that the RSS settings

Umesh__Deepthi · ‎09-22-2018

Hello John McCalpin et al..

This is my first post on dev forum and pardon me if my question is very silly but I need your kind help. I am working on one of the research projects related to NUMA targeted optimization for scheduling threads.I am trying to figure out how to know per thread, for each memory access, from which NUMA node the memory access was made to which NUMA node .

I am using Intel VTune and running matrix multiplication and PARSEC benchmark programs. According to Vtune the % of remote access is >40%. I would like to know the entire thread life cycle and memory footprint.Could you kindly give me some inputs as to how to capture such information.

Is using perf_event_open one of the ways to do?..Kindly help.

Thank you in advance

Deepthi

jimdempseyatthecove · ‎10-20-2018

From my experience of the PARSEC benchmark (a few years ago), memory allocation is via malloc or new. On a NUMA structured machine, at process launch, a Virtual Machine is established for the process, Virtual Address space is established, but no physical RAM is allocate nor are page file pages provided to backup the not yet allocated physical RAM. As memory is use to load the application, physical RAM and page file pages are allocated as memory is touched (written or read). The physical RAM will be (usually) allocated from the memory node of the thread performing the touch. The application heap (assuming it is not touched, e.g. wiped), for the portion that is not touched (e.g. not the page of the node header) is not assigned to any memory node. As malloc/new allocations are made to previously unallocated memory, the pages will be mapped from the node of the thread performing the first touch. Note, the heap allocation allocates Virtual Address space whereas the actual memory allocation occurs later upon first touch.

Now then, with the above in mind, when you wish to optimize NUMA node access (by thread), it behooves you to assure that the first touch of the appropriate memory areas are performed by the (affinity pinned) thread that will perform the majority of the memory accesses. As an example, for the matrix multiplication example, construct an analog of the multiplication that performs the initialization of the arrays. IOW do not perform the initialization via the main thread. Instead, using the same parallel construct as for the multiplication, perform the initialization.

Additionally, while huge memory pages will reduce the TLB load frequency, the use of huge pages increases the page granularity, and thus may make it difficult to reduce inter-node accesses. IOW for NUMA efficient use, use small page sizes (YMMV).

Jim Dempsey

McCalpinJohn · ‎10-22-2018

The VTune user guide provides guidance on what views to look at to help understand NUMA issues -- https://software.intel.com/en-us/vtune-amplifier-help-memory-usage-view

For whole-program measurements, the Linux "perf mem" uses the Intel Load Latency facility to randomly sample loads and report where the load found its data. The report gives the virtual address being loaded (which can be compared to the virtual addresses of the major data structures) and reports the location where the data was found. (Note that cache hits don't guarantee that the data was actually local -- the data may have been moved into the cache by a hardware prefetch operation, and the counter only records where the "load" instruction found the data, not whether a hardware prefetch had moved that cache line recently.)