<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic &amp;gt;&amp;gt;It seems that the file in Intel® Moderncode for Parallel Architectures</title>
    <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/How-can-allocate-memory-within-socket-NUMA-node-manually-in/m-p/1129054#M7655</link>
    <description>&lt;P&gt;&amp;gt;&amp;gt;&lt;EM&gt;It seems that the file system cache is included in used memory...&lt;/EM&gt;&lt;/P&gt;

&lt;P&gt;You should experiment with reducing/restricting the file system cache.&lt;/P&gt;

&lt;P&gt;Jim Dempsey&lt;/P&gt;</description>
    <pubDate>Thu, 15 Feb 2018 15:47:26 GMT</pubDate>
    <dc:creator>jimdempseyatthecove</dc:creator>
    <dc:date>2018-02-15T15:47:26Z</dc:date>
    <item>
      <title>How can allocate memory within socket(NUMA node) manually in multi socket system?</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/How-can-allocate-memory-within-socket-NUMA-node-manually-in/m-p/1129049#M7650</link>
      <description>&lt;P style="margin-bottom: 1em; border: 0px; font-variant-numeric: inherit; font-variant-east-asian: inherit; font-stretch: inherit; font-size: 15px; line-height: inherit; font-family: Arial, &amp;quot;Helvetica Neue&amp;quot;, Helvetica, sans-serif; vertical-align: baseline; clear: both; color: rgb(36, 39, 41);"&gt;I have a quadra-socket system on RHEL and I'm testing a job that takes about 1-day runtime.&lt;/P&gt;

&lt;P style="margin-bottom: 1em; border: 0px; font-variant-numeric: inherit; font-variant-east-asian: inherit; font-stretch: inherit; font-size: 15px; line-height: inherit; font-family: Arial, &amp;quot;Helvetica Neue&amp;quot;, Helvetica, sans-serif; vertical-align: baseline; clear: both; color: rgb(36, 39, 41);"&gt;So it's important to bind the process within sockets because NUMA node makes a runtime variation.&lt;/P&gt;

&lt;P style="margin-bottom: 1em; border: 0px; font-variant-numeric: inherit; font-variant-east-asian: inherit; font-stretch: inherit; font-size: 15px; line-height: inherit; font-family: Arial, &amp;quot;Helvetica Neue&amp;quot;, Helvetica, sans-serif; vertical-align: baseline; clear: both; color: rgb(36, 39, 41);"&gt;The problem is when the job takes about 50% of node 0's memory, OS makes the job use node 3's memory.&lt;/P&gt;

&lt;P style="margin-bottom: 1em; border: 0px; font-variant-numeric: inherit; font-variant-east-asian: inherit; font-stretch: inherit; font-size: 15px; line-height: inherit; font-family: Arial, &amp;quot;Helvetica Neue&amp;quot;, Helvetica, sans-serif; vertical-align: baseline; clear: both; color: rgb(36, 39, 41);"&gt;I used "taskset -c", but it's showing the same result.&lt;/P&gt;

&lt;P style="margin-bottom: 1em; border: 0px; font-variant-numeric: inherit; font-variant-east-asian: inherit; font-stretch: inherit; font-size: 15px; line-height: inherit; font-family: Arial, &amp;quot;Helvetica Neue&amp;quot;, Helvetica, sans-serif; vertical-align: baseline; clear: both; color: rgb(36, 39, 41);"&gt;Can I make a job using node 0's memory fully and then use another node's memory?&lt;/P&gt;</description>
      <pubDate>Fri, 26 Jan 2018 06:50:40 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/How-can-allocate-memory-within-socket-NUMA-node-manually-in/m-p/1129049#M7650</guid>
      <dc:creator>Soh__Mingyun</dc:creator>
      <dc:date>2018-01-26T06:50:40Z</dc:date>
    </item>
    <item>
      <title>The default allocation policy</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/How-can-allocate-memory-within-socket-NUMA-node-manually-in/m-p/1129050#M7651</link>
      <description>&lt;P&gt;The default allocation policy on Linux is "local", so if the job is bound to socket 0, it will attempt to allocate its pages on socket 0.&amp;nbsp; When the free pages on socket 0 are depleted, pages will be allocated on other nodes.&amp;nbsp; The command "numactl --localalloc" does not actually do anything, since local allocation is the default.&amp;nbsp; As far as I can tell, the command "numactl --preferred=0" also does nothing (assuming that node 0 is the local node).&lt;/P&gt;

&lt;P&gt;The "numactl" command does have an option to &lt;STRONG&gt;force&lt;/STRONG&gt; pages to be allocated to a specified socket (or group of sockets) -- "numactl --membind=0".&amp;nbsp; With this command, when the free pages on socket 0 are depleted, the OS will work very hard to free up more pages on socket 0. This includes writing back dirty filesystem cache pages, dropping clean filesystem cache pages, etc.&amp;nbsp;&amp;nbsp; If, after going to all this extra effort, the OS is unable to allocate a page on socket 0, the user job will be aborted.&lt;/P&gt;

&lt;P&gt;What Linux is missing is the middle ground between these two policies -- one that would try hard to free up pages on the target node, but which would then allocate remotely if no more pages can be freed on the target node.&amp;nbsp;&amp;nbsp; Since Linux has no such policy, all you can do is attempt to emulate the steps taken by "numactl --membind".&amp;nbsp; The easiest and most important of these is to drop the filesystem caches before starting your job.&amp;nbsp; While this does not completely eliminate the excess OS memory usage on socket 0, it does significantly reduce it.&lt;/P&gt;</description>
      <pubDate>Fri, 26 Jan 2018 15:23:21 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/How-can-allocate-memory-within-socket-NUMA-node-manually-in/m-p/1129050#M7651</guid>
      <dc:creator>McCalpinJohn</dc:creator>
      <dc:date>2018-01-26T15:23:21Z</dc:date>
    </item>
    <item>
      <title>Thank you very much for your</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/How-can-allocate-memory-within-socket-NUMA-node-manually-in/m-p/1129051#M7652</link>
      <description>&lt;P&gt;Thank you very much for your teaching.&lt;/P&gt;

&lt;P&gt;Could you let me know how to drop the filesystem caches?&lt;/P&gt;

&lt;P&gt;&lt;/P&gt;&lt;BLOCKQUOTE&gt;McCalpin, John wrote:&lt;BR /&gt;&lt;P&gt;&lt;/P&gt;

&lt;P&gt;The default allocation policy on Linux is "local", so if the job is bound to socket 0, it will attempt to allocate its pages on socket 0.&amp;nbsp; When the free pages on socket 0 are depleted, pages will be allocated on other nodes.&amp;nbsp; The command "numactl --localalloc" does not actually do anything, since local allocation is the default.&amp;nbsp; As far as I can tell, the command "numactl --preferred=0" also does nothing (assuming that node 0 is the local node).&lt;/P&gt;

&lt;P&gt;The "numactl" command does have an option to &lt;STRONG&gt;force&lt;/STRONG&gt; pages to be allocated to a specified socket (or group of sockets) -- "numactl --membind=0".&amp;nbsp; With this command, when the free pages on socket 0 are depleted, the OS will work very hard to free up more pages on socket 0. This includes writing back dirty filesystem cache pages, dropping clean filesystem cache pages, etc.&amp;nbsp;&amp;nbsp; If, after going to all this extra effort, the OS is unable to allocate a page on socket 0, the user job will be aborted.&lt;/P&gt;

&lt;P&gt;What Linux is missing is the middle ground between these two policies -- one that would try hard to free up pages on the target node, but which would then allocate remotely if no more pages can be freed on the target node.&amp;nbsp;&amp;nbsp; Since Linux has no such policy, all you can do is attempt to emulate the steps taken by "numactl --membind".&amp;nbsp; The easiest and most important of these is to drop the filesystem caches before starting your job.&amp;nbsp; While this does not completely eliminate the excess OS memory usage on socket 0, it does significantly reduce it.&lt;/P&gt;

&lt;P&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 31 Jan 2018 05:11:25 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/How-can-allocate-memory-within-socket-NUMA-node-manually-in/m-p/1129051#M7652</guid>
      <dc:creator>Soh__Mingyun</dc:creator>
      <dc:date>2018-01-31T05:11:25Z</dc:date>
    </item>
    <item>
      <title>Dropping caches in Linux can</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/How-can-allocate-memory-within-socket-NUMA-node-manually-in/m-p/1129052#M7653</link>
      <description>&lt;P&gt;Dropping caches in Linux can be done using two equivalent interfaces, and can be done at three different levels of aggressiveness.&lt;/P&gt;

&lt;P&gt;Reference: &lt;A href="https://www.kernel.org/doc/Documentation/sysctl/vm.txt" target="_blank"&gt;https://www.kernel.org/doc/Documentation/sysctl/vm.txt&lt;/A&gt; (search for "drop_caches" in the page).&lt;/P&gt;

&lt;P&gt;The two (equivalent) approaches (which must be run as root) are:&lt;/P&gt;

&lt;UL&gt;
	&lt;LI&gt;sync; echo 1 &amp;gt; /proc/sys/vm/drop_caches&lt;/LI&gt;
	&lt;LI&gt;sync; /sbin/sysctl -w vm.drop_caches=1&lt;/LI&gt;
&lt;/UL&gt;

&lt;P&gt;In our batch production environment, we run this in the epilog of every job.&amp;nbsp; This has dramatically reduced the problems that we have seen with NUMA allocation failures on socket 0 due to excessive OS memory usage there.....&lt;/P&gt;</description>
      <pubDate>Wed, 31 Jan 2018 16:15:37 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/How-can-allocate-memory-within-socket-NUMA-node-manually-in/m-p/1129052#M7653</guid>
      <dc:creator>McCalpinJohn</dc:creator>
      <dc:date>2018-01-31T16:15:37Z</dc:date>
    </item>
    <item>
      <title>A lot has helped, thanks.</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/How-can-allocate-memory-within-socket-NUMA-node-manually-in/m-p/1129053#M7654</link>
      <description>&lt;P&gt;A lot has helped, thanks.&lt;BR /&gt;
	I run a test that takes about 24hours with 4 core, but NUMA causes runtime variation.&lt;BR /&gt;
	So I bound the core and the variation was reduced, but sometimes runtime variation occurs and I am thinking that memory is the cause.&lt;BR /&gt;
	numactl --membind&amp;nbsp;causing problems when memory is insufficient.&lt;BR /&gt;
	So I'm thinking using numactl --physcpubind --preferred. Is there anything I'm thinking wrong? Is it right to control memory?&lt;BR /&gt;
	It seems that the file system cache is included in used memory and is not able to use enough memory of the node and goes to another node. Is it right?&lt;BR /&gt;
	Or is it using the memory of the binding node and moving the data there to the memory of another node? It's too hard world..&lt;/P&gt;

&lt;P&gt;&lt;/P&gt;&lt;BLOCKQUOTE&gt;McCalpin, John wrote:&lt;BR /&gt;&lt;P&gt;&lt;/P&gt;

&lt;P&gt;Dropping caches in Linux can be done using two equivalent interfaces, and can be done at three different levels of aggressiveness.&lt;/P&gt;

&lt;P&gt;Reference: &lt;A href="https://www.kernel.org/doc/Documentation/sysctl/vm.txt"&gt;https://www.kernel.org/doc/Documentation/sysctl/vm.txt&lt;/A&gt; (search for "drop_caches" in the page).&lt;/P&gt;

&lt;P&gt;The two (equivalent) approaches (which must be run as root) are:&lt;/P&gt;

&lt;UL&gt;
	&lt;LI&gt;sync; echo 1 &amp;gt; /proc/sys/vm/drop_caches&lt;/LI&gt;
	&lt;LI&gt;sync; /sbin/sysctl -w vm.drop_caches=1&lt;/LI&gt;
&lt;/UL&gt;

&lt;P&gt;In our batch production environment, we run this in the epilog of every job.&amp;nbsp; This has dramatically reduced the problems that we have seen with NUMA allocation failures on socket 0 due to excessive OS memory usage there.....&lt;/P&gt;

&lt;P&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 15 Feb 2018 14:44:28 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/How-can-allocate-memory-within-socket-NUMA-node-manually-in/m-p/1129053#M7654</guid>
      <dc:creator>Soh__Mingyun</dc:creator>
      <dc:date>2018-02-15T14:44:28Z</dc:date>
    </item>
    <item>
      <title>&gt;&gt;It seems that the file</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/How-can-allocate-memory-within-socket-NUMA-node-manually-in/m-p/1129054#M7655</link>
      <description>&lt;P&gt;&amp;gt;&amp;gt;&lt;EM&gt;It seems that the file system cache is included in used memory...&lt;/EM&gt;&lt;/P&gt;

&lt;P&gt;You should experiment with reducing/restricting the file system cache.&lt;/P&gt;

&lt;P&gt;Jim Dempsey&lt;/P&gt;</description>
      <pubDate>Thu, 15 Feb 2018 15:47:26 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/How-can-allocate-memory-within-socket-NUMA-node-manually-in/m-p/1129054#M7655</guid>
      <dc:creator>jimdempseyatthecove</dc:creator>
      <dc:date>2018-02-15T15:47:26Z</dc:date>
    </item>
    <item>
      <title>I checked file system cache</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/How-can-allocate-memory-within-socket-NUMA-node-manually-in/m-p/1129055#M7656</link>
      <description>&lt;P&gt;I checked file system cache is included in used memory.&lt;/P&gt;

&lt;P&gt;But I'm trying to make a good result without dropping file system cache.&lt;/P&gt;

&lt;P&gt;&lt;/P&gt;&lt;BLOCKQUOTE&gt;jimdempseyatthecove wrote:&lt;BR /&gt;&lt;P&gt;&lt;/P&gt;

&lt;P&gt;&amp;gt;&amp;gt;&lt;EM&gt;It seems that the file system cache is included in used memory...&lt;/EM&gt;&lt;/P&gt;

&lt;P&gt;You should experiment with reducing/restricting the file system cache.&lt;/P&gt;

&lt;P&gt;Jim Dempsey&lt;/P&gt;

&lt;P&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 20 Feb 2018 01:23:44 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/How-can-allocate-memory-within-socket-NUMA-node-manually-in/m-p/1129055#M7656</guid>
      <dc:creator>Soh__Mingyun</dc:creator>
      <dc:date>2018-02-20T01:23:44Z</dc:date>
    </item>
    <item>
      <title>It is definitely a good idea</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/How-can-allocate-memory-within-socket-NUMA-node-manually-in/m-p/1129056#M7657</link>
      <description>&lt;P&gt;It is definitely a good idea to use thread binding to prevent the working processes and threads from moving away from their data.&amp;nbsp;&amp;nbsp; The "numactl" and "taskset" commands are both suitable for cases where the process's threads do not span multiple NUMA nodes.&amp;nbsp; Binding processes to nodes (with the "-cpunodebind" option) is sufficient to ensure NUMA affinity.&amp;nbsp; Binding to specific logical processors is typically only required if you are running with HyperThreading enabled, and you want to run only one process (or thread) per physical core, and your OS is not smart enough to schedule your processes on separate physical processors.&lt;/P&gt;

&lt;P&gt;As far as I can tell (e.g., from &lt;A href="https://www.kernel.org/doc/Documentation/vm/numa_memory_policy.txt)" target="_blank"&gt;https://www.kernel.org/doc/Documentation/vm/numa_memory_policy.txt)&lt;/A&gt;, Linux supports only four policies&lt;/P&gt;

&lt;OL&gt;
	&lt;LI&gt;Default (almost always "local, preferred")&lt;/LI&gt;
	&lt;LI&gt;Bind (specific target(s), mandatory)&lt;/LI&gt;
	&lt;LI&gt;Preferred (like "local", but starting with specified target node(s), rather than the local node)&lt;/LI&gt;
	&lt;LI&gt;Interleaved&lt;/LI&gt;
&lt;/OL&gt;

&lt;P&gt;Although I can't find it discussed in the Linux documentation, my experience has been that only the "Bind" option goes to the extra effort of freeing pages in order to satisfy the policy.&lt;/P&gt;</description>
      <pubDate>Tue, 20 Feb 2018 17:30:20 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/How-can-allocate-memory-within-socket-NUMA-node-manually-in/m-p/1129056#M7657</guid>
      <dc:creator>McCalpinJohn</dc:creator>
      <dc:date>2018-02-20T17:30:20Z</dc:date>
    </item>
    <item>
      <title>Hi, Dr. McCalpin.</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/How-can-allocate-memory-within-socket-NUMA-node-manually-in/m-p/1129057#M7658</link>
      <description>&lt;P&gt;Dear, Dr. McCalpin.&lt;/P&gt;

&lt;P&gt;I appreciate a lot of&amp;nbsp;your&amp;nbsp;teaching.&lt;/P&gt;

&lt;P&gt;It was very helpful to my work.&lt;/P&gt;

&lt;P&gt;I solved my&amp;nbsp;problem by service daemon "numad".&lt;/P&gt;

&lt;P&gt;But one problem occurred.&lt;/P&gt;

&lt;P&gt;I tested with Intel V3, V5 and&amp;nbsp;E5-4655(4CPU, 24core)&amp;nbsp;machine.&lt;/P&gt;

&lt;P&gt;In V3 and V5, there's no problem using numad&amp;nbsp;with a multi-process&amp;nbsp;job.&lt;/P&gt;

&lt;P&gt;But in&amp;nbsp;E5-4655 machine, processers did not run as I intended.&lt;/P&gt;

&lt;P&gt;For example, I ran 8CPU job, processors worked in only 1 NUMA(6CPU).&lt;/P&gt;

&lt;P&gt;Also, 13CPU, 16CPU jobs showed the same phenomenon(I used -u 100/110).&lt;/P&gt;

&lt;P&gt;It's really puzzling(I&lt;SPAN style="font-size: 1em;"&gt;&amp;nbsp;turned &lt;/SPAN&gt;off&lt;SPAN style="font-size: 1em;"&gt; &lt;/SPAN&gt;hyperthreading).&lt;/P&gt;

&lt;P&gt;Do you have any idea about this?&lt;/P&gt;

&lt;P&gt;Your idea would be highly appreciated.&lt;/P&gt;

&lt;P&gt;Respectfully,&lt;/P&gt;

&lt;P&gt;Min-Gyun Soh&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&lt;/P&gt;&lt;BLOCKQUOTE&gt;McCalpin, John wrote:&lt;BR /&gt;&lt;P&gt;&lt;/P&gt;

&lt;P&gt;It is definitely a good idea to use thread binding to prevent the working processes and threads from moving away from their data.&amp;nbsp;&amp;nbsp; The "numactl" and "taskset" commands are both suitable for cases where the process's threads do not span multiple NUMA nodes.&amp;nbsp; Binding processes to nodes (with the "-cpunodebind" option) is sufficient to ensure NUMA affinity.&amp;nbsp; Binding to specific logical processors is typically only required if you are running with HyperThreading enabled, and you want to run only one process (or thread) per physical core, and your OS is not smart enough to schedule your processes on separate physical processors.&lt;/P&gt;

&lt;P&gt;As far as I can tell (e.g., from &lt;A href="https://www.kernel.org/doc/Documentation/vm/numa_memory_policy.txt"&gt;https://www.kernel.org/doc/Documentation/vm/numa_memory_policy.txt&lt;/A&gt;), Linux supports only four policies&lt;/P&gt;

&lt;OL&gt;
	&lt;LI&gt;Default (almost always "local, preferred")&lt;/LI&gt;
	&lt;LI&gt;Bind (specific target(s), mandatory)&lt;/LI&gt;
	&lt;LI&gt;Preferred (like "local", but starting with specified target node(s), rather than the local node)&lt;/LI&gt;
	&lt;LI&gt;Interleaved&lt;/LI&gt;
&lt;/OL&gt;

&lt;P&gt;Although I can't find it discussed in the Linux documentation, my experience has been that only the "Bind" option goes to the extra effort of freeing pages in order to satisfy the policy.&lt;/P&gt;

&lt;P&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 06 Apr 2018 16:12:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/How-can-allocate-memory-within-socket-NUMA-node-manually-in/m-p/1129057#M7658</guid>
      <dc:creator>Soh__Mingyun</dc:creator>
      <dc:date>2018-04-06T16:12:00Z</dc:date>
    </item>
    <item>
      <title>The motherboard BIOS has a</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/How-can-allocate-memory-within-socket-NUMA-node-manually-in/m-p/1129058#M7659</link>
      <description>&lt;P&gt;The motherboard BIOS has a setting that permits all memory to be viewed as one node (often this is called Interleaved) or as multiple nodes. Note, I have observed some BIOS manuals use the term interleaved backwards.&lt;/P&gt;

&lt;P&gt;Jim Dempsey&lt;/P&gt;</description>
      <pubDate>Fri, 06 Apr 2018 16:51:11 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/How-can-allocate-memory-within-socket-NUMA-node-manually-in/m-p/1129058#M7659</guid>
      <dc:creator>jimdempseyatthecove</dc:creator>
      <dc:date>2018-04-06T16:51:11Z</dc:date>
    </item>
    <item>
      <title>I don't know anything about</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/How-can-allocate-memory-within-socket-NUMA-node-manually-in/m-p/1129059#M7660</link>
      <description>&lt;P&gt;I don't know anything about "numad" -- we don't seem to have it installed on our systems -- so I don't know whether the binding problem you are seeing on the Xeon E5-4655 system is related to "numad" or to something else....&lt;/P&gt;</description>
      <pubDate>Fri, 06 Apr 2018 18:45:03 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/How-can-allocate-memory-within-socket-NUMA-node-manually-in/m-p/1129059#M7660</guid>
      <dc:creator>McCalpinJohn</dc:creator>
      <dc:date>2018-04-06T18:45:03Z</dc:date>
    </item>
    <item>
      <title>Hi Guys,</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/How-can-allocate-memory-within-socket-NUMA-node-manually-in/m-p/1129060#M7661</link>
      <description>&lt;P&gt;Hi Guys,&lt;/P&gt;

&lt;P&gt;This is great information, thank you! I do have a follow-up question. What are my options on Windows to achieve the correct NUMA mapping?&amp;nbsp;&lt;/P&gt;

&lt;P&gt;I found the following from this &lt;A href="https://software.intel.com/en-us/articles/intel-mkl-numa-notes"&gt;link&lt;/A&gt;;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="color: rgb(83, 86, 90); font-family: &amp;quot;Helvetica Neue&amp;quot;, Helvetica, Arial, sans-serif; font-size: 15px;"&gt;On Windows* OS, there isn’t a command equivalent to numactl. When NUMA is enabled on Windows* OS, the only memory allocation policy is “local”. For applications that need interleaved memory mapping across nodes on a multi-socket machine, NUMA has to be disabled.&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;Is this really the case or are there any new tools like numactl&amp;nbsp;that I can use on Windows to control NUMA mapping. I also came across a &lt;A href="https://docs.microsoft.com/en-us/windows/desktop/procthread/numa-support"&gt;NUMA API&lt;/A&gt;&amp;nbsp;for Windows but I don't think this is supported on Linux. It is critical I find a way of controlling NUMA mapping in an OS-agnostic way. Any help will be much appreciated.&lt;/P&gt;

&lt;P&gt;Thanks,&lt;/P&gt;

&lt;P&gt;Gandharv&lt;/P&gt;</description>
      <pubDate>Mon, 10 Sep 2018 15:28:34 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/How-can-allocate-memory-within-socket-NUMA-node-manually-in/m-p/1129060#M7661</guid>
      <dc:creator>Gandharv_K_</dc:creator>
      <dc:date>2018-09-10T15:28:34Z</dc:date>
    </item>
    <item>
      <title>There is an extensive library</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/How-can-allocate-memory-within-socket-NUMA-node-manually-in/m-p/1129061#M7662</link>
      <description>&lt;P&gt;There is an extensive library of NUMA APIs for Linux systems.&amp;nbsp; See &lt;A href="https://linux.die.net/man/3/numa" target="_blank"&gt;https://linux.die.net/man/3/numa&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 10 Sep 2018 17:37:23 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/How-can-allocate-memory-within-socket-NUMA-node-manually-in/m-p/1129061#M7662</guid>
      <dc:creator>McCalpinJohn</dc:creator>
      <dc:date>2018-09-10T17:37:23Z</dc:date>
    </item>
    <item>
      <title>Linux only is not the issue.</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/How-can-allocate-memory-within-socket-NUMA-node-manually-in/m-p/1129062#M7663</link>
      <description>&lt;P&gt;Linux is not the issue. I want something for Windows and specifically something that works on both Windows and Linux.&lt;/P&gt;</description>
      <pubDate>Mon, 10 Sep 2018 22:27:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/How-can-allocate-memory-within-socket-NUMA-node-manually-in/m-p/1129062#M7663</guid>
      <dc:creator>Gandharv_K_</dc:creator>
      <dc:date>2018-09-10T22:27:00Z</dc:date>
    </item>
    <item>
      <title> If you create a Multi-Vendor</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/How-can-allocate-memory-within-socket-NUMA-node-manually-in/m-p/1129063#M7664</link>
      <description>&lt;P&gt;&lt;SPAN style="color: rgb(84, 84, 84); font-family: arial, sans-serif; font-size: small;"&gt;&amp;nbsp;If you create a&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN style="font-weight: bold; color: rgb(106, 106, 106); font-family: arial, sans-serif; font-size: small;"&gt;Multi&lt;/SPAN&gt;&lt;SPAN style="color: rgb(84, 84, 84); font-family: arial, sans-serif; font-size: small;"&gt;-Vendor Team, you must&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN style="font-weight: bold; color: rgb(106, 106, 106); font-family: arial, sans-serif; font-size: small;"&gt;manually&lt;/SPAN&gt;&lt;SPAN style="color: rgb(84, 84, 84); font-family: arial, sans-serif; font-size: small;"&gt;&amp;nbsp;verify that the RSS settings&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 13 Sep 2018 03:56:40 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/How-can-allocate-memory-within-socket-NUMA-node-manually-in/m-p/1129063#M7664</guid>
      <dc:creator>anh__ngoc</dc:creator>
      <dc:date>2018-09-13T03:56:40Z</dc:date>
    </item>
    <item>
      <title> </title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/How-can-allocate-memory-within-socket-NUMA-node-manually-in/m-p/1129064#M7665</link>
      <description>&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;Hello John McCalpin et al..&lt;/P&gt;

&lt;P&gt;This is my first post on dev forum and pardon me if my question is very silly but I need your kind help.&amp;nbsp;&lt;SPAN style="font-size: 1em;"&gt;I am working on one of the research projects related to NUMA targeted optimization for scheduling threads.I am trying to figure out how to know per thread,&amp;nbsp; for each memory access, from which NUMA node the memory access was made to which NUMA node . &lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 1em;"&gt;I am using Intel VTune and running matrix multiplication and PARSEC benchmark programs. According to Vtune the % of remote access is &amp;gt;40%. I would like to know the entire thread life cycle and memory footprint.Could you kindly give me some inputs as to how to capture such information.&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 1em;"&gt;Is using perf_event_open one of the ways to do?..Kindly help.&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;Thank you in advance&lt;/P&gt;

&lt;P&gt;Deepthi&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Sat, 22 Sep 2018 07:05:51 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/How-can-allocate-memory-within-socket-NUMA-node-manually-in/m-p/1129064#M7665</guid>
      <dc:creator>Umesh__Deepthi</dc:creator>
      <dc:date>2018-09-22T07:05:51Z</dc:date>
    </item>
    <item>
      <title>From my experience of the</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/How-can-allocate-memory-within-socket-NUMA-node-manually-in/m-p/1129065#M7666</link>
      <description>&lt;P&gt;From my experience of the PARSEC benchmark (a few years ago), memory allocation is via malloc or new. On a NUMA structured machine, at process launch, a Virtual Machine is established for the process, Virtual Address space is established, but no physical RAM is allocate nor are page file pages&amp;nbsp;provided to backup the not yet allocated physical RAM. As memory is use to load the application, physical RAM and page file pages are allocated as memory is touched (written or read). The physical RAM will be (usually) allocated from the memory node of the thread performing the touch. The application heap (assuming it is not touched, e.g. wiped), for the portion that is not touched (e.g. not the page of the node header) is not assigned to any memory node. As malloc/new allocations are made to previously unallocated memory, the pages will be mapped&amp;nbsp;from the node of the thread performing the first touch. Note, the heap allocation allocates Virtual Address space whereas the actual memory allocation occurs later upon first touch.&lt;/P&gt;

&lt;P&gt;Now then, with the above in mind, when you wish to optimize NUMA node access (by thread), it behooves you to assure that the first touch of the appropriate memory areas are performed by the (affinity pinned) thread that will perform the majority of the memory accesses. As an example, for the matrix multiplication example, construct an analog of the multiplication that performs the initialization of the arrays. IOW do not perform the initialization via the main thread. Instead, using the same parallel construct as for the multiplication, perform the initialization.&lt;/P&gt;

&lt;P&gt;Additionally, while huge memory pages will reduce the TLB load frequency, the use of huge pages increases the page granularity, and thus may make it difficult to reduce inter-node accesses. IOW for NUMA efficient use, use small page sizes (YMMV).&lt;/P&gt;

&lt;P&gt;Jim Dempsey&lt;/P&gt;</description>
      <pubDate>Sat, 20 Oct 2018 15:20:14 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/How-can-allocate-memory-within-socket-NUMA-node-manually-in/m-p/1129065#M7666</guid>
      <dc:creator>jimdempseyatthecove</dc:creator>
      <dc:date>2018-10-20T15:20:14Z</dc:date>
    </item>
    <item>
      <title>The VTune user guide provides</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/How-can-allocate-memory-within-socket-NUMA-node-manually-in/m-p/1129066#M7667</link>
      <description>&lt;P&gt;The VTune user guide provides guidance on what views to look at to help understand NUMA issues -- &lt;A href="https://software.intel.com/en-us/vtune-amplifier-help-memory-usage-view" target="_blank"&gt;https://software.intel.com/en-us/vtune-amplifier-help-memory-usage-view&lt;/A&gt;&lt;/P&gt;

&lt;P&gt;For whole-program measurements, the Linux "perf mem" uses the Intel Load Latency facility to randomly sample loads and report where the load found its data.&amp;nbsp; The report gives the virtual address being loaded (which can be compared to the virtual addresses of the major data structures) and reports the location where the data was found.&amp;nbsp;&amp;nbsp; (Note that cache hits don't guarantee that the data was actually local -- the data may have been moved into the cache by a hardware prefetch operation, and the counter only records where the "load" instruction found the data, not whether a hardware prefetch had moved that cache line recently.)&lt;/P&gt;</description>
      <pubDate>Mon, 22 Oct 2018 16:59:51 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/How-can-allocate-memory-within-socket-NUMA-node-manually-in/m-p/1129066#M7667</guid>
      <dc:creator>McCalpinJohn</dc:creator>
      <dc:date>2018-10-22T16:59:51Z</dc:date>
    </item>
  </channel>
</rss>

