<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Hi, in Software Tuning, Performance Optimization &amp; Platform Monitoring</title>
    <link>https://community.intel.com/t5/Software-Tuning-Performance/Memory-bandwidth-on-a-NUMA-system/m-p/1095837#M5726</link>
    <description>&lt;P&gt;Hi,&lt;/P&gt;

&lt;P&gt;Low core count parts have lower remote memory b/w in the default configuration which uses early snoop mode. However, if you change the snooping mode to "Home snoop" through BIOS, you will see much higher remote memory b/w. There will be a slight increase in the latency but b/w would improve&lt;/P&gt;

&lt;P&gt;Vish&lt;/P&gt;</description>
    <pubDate>Thu, 03 Dec 2015 17:07:24 GMT</pubDate>
    <dc:creator>Krishnaswa_V_Intel</dc:creator>
    <dc:date>2015-12-03T17:07:24Z</dc:date>
    <item>
      <title>Memory bandwidth on a NUMA system</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Memory-bandwidth-on-a-NUMA-system/m-p/1095836#M5725</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;

&lt;P&gt;I'm looking into memory performance results on a&amp;nbsp;&lt;SPAN style="font-size: 12px; line-height: 14.4px;"&gt;Xeon E5-2620V3 system with 2 NUMA nodes and 2 QPI links in between.&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN style="font-size: 12px; line-height: 14.4px;"&gt;With NUMA enabled in the BIOS, the Memory Latency Checker tool reports 44GB/s local throughput and 6GB/s remote, which looks too low.&lt;/SPAN&gt;&lt;/P&gt;

&lt;PRE style="padding: 1em; border: 1px dashed rgb(47, 111, 171); color: rgb(0, 0, 0); line-height: 1.1em; text-align: justify; background-color: rgb(249, 249, 249);"&gt;                Numa node
Numa node            0       1
       0        44266.2  6004.0
       1         5980.9 44311.9&lt;/PRE&gt;

&lt;P&gt;With NUMA disabled (which results in cache line interleaving AFAIU), the combined throughput is ~40GB/s. PCM shows an increased QPI traffic in this mode. So I would expect the figure to be somewhere in the middle between 44GB/s and 6GB/s with NUMA on.&lt;/P&gt;

&lt;PRE style="padding: 1em; border: 1px dashed rgb(47, 111, 171); color: rgb(0, 0, 0); line-height: 1.1em; text-align: justify; background-color: rgb(249, 249, 249);"&gt;        Memory node
 Socket      0       1
     0  39537.2 39588.7
     1  39515.2 39527.0&lt;/PRE&gt;

&lt;P&gt;Any ideas?&lt;BR /&gt;
	I'm also curious to know how the tool (mlc) measures the bandwidth? Does it rely on PMU counters, or does it just count the memory ops from the standpoint of a client?&lt;/P&gt;

&lt;P&gt;Thanks,&lt;BR /&gt;
	Ilya&lt;/P&gt;</description>
      <pubDate>Thu, 03 Dec 2015 16:52:58 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Memory-bandwidth-on-a-NUMA-system/m-p/1095836#M5725</guid>
      <dc:creator>Ilya_M_1</dc:creator>
      <dc:date>2015-12-03T16:52:58Z</dc:date>
    </item>
    <item>
      <title>Hi,</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Memory-bandwidth-on-a-NUMA-system/m-p/1095837#M5726</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;

&lt;P&gt;Low core count parts have lower remote memory b/w in the default configuration which uses early snoop mode. However, if you change the snooping mode to "Home snoop" through BIOS, you will see much higher remote memory b/w. There will be a slight increase in the latency but b/w would improve&lt;/P&gt;

&lt;P&gt;Vish&lt;/P&gt;</description>
      <pubDate>Thu, 03 Dec 2015 17:07:24 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Memory-bandwidth-on-a-NUMA-system/m-p/1095837#M5726</guid>
      <dc:creator>Krishnaswa_V_Intel</dc:creator>
      <dc:date>2015-12-03T17:07:24Z</dc:date>
    </item>
    <item>
      <title>I also saw unusually low</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Memory-bandwidth-on-a-NUMA-system/m-p/1095838#M5727</link>
      <description>&lt;P&gt;I also saw unusually low numbers for remote accesses on a low-core-count Haswell EP (Xeon E5-2603 v3), but I no longer have access to the system to check the BIOS snoop configuration.&amp;nbsp; The 2603 is even slower than the 2620 -- lower core clock rates and lower DRAM clock rates, with the Intel Memory Latency Checker delivering only 4.7 GB/s between sockets for the default (all reads) case.&lt;/P&gt;

&lt;P&gt;The performance varies a bit with access pattern, but not by huge amounts, so the non-NUMA results you obtained are almost certainly not possible.&amp;nbsp;&amp;nbsp; If the data were actually interleaved between the sockets, the 39.5 GB/s you got in the second case would mean almost 20 GB/s from the local memory plus almost 20 GB/s across QPI from the remote memory.&amp;nbsp; This is more than 3x the value that was measured directly, so it seems implausible.&lt;/P&gt;

&lt;P&gt;You will have to check your BIOS documentation to be sure, but I don't think that disabling NUMA generates cache-line interleaving. You are much more likely to get no NUMA information, which will lead to uncontrollable pseudo-random page placement.&amp;nbsp;&lt;/P&gt;

&lt;P&gt;It is difficult to tell exactly what the Intel Memory Latency Checker is doing for each test.&amp;nbsp; I know that it internally changes the CPU frequency to the maximum value and enables the hardware prefetchers if they are disabled, so it might also be obtaining NUMA information from the hardware that the OS is not aware of.&amp;nbsp;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;To test performance for interleaved memory, I recommend using a code that is more transparent, such as the STREAM benchmark (http://www.cs.virginia.edu/stream/).&amp;nbsp;&amp;nbsp; If you have the Intel compilers you can configure this to use streaming stores or to avoid streaming stores.&amp;nbsp; With gcc you will not get streaming stores (except at "-O3" where the compiler will replace the "Copy" kernel with an optimized library routine that uses streaming stores).&lt;/P&gt;

&lt;P&gt;A useful set of numbers might come from:&lt;/P&gt;

&lt;BLOCKQUOTE&gt;
	&lt;P&gt;icc -O3 -xAVX2 -openmp -ffreestanding -DSTREAM_ARRAY_SIZE=80000000 -opt-streaming-stores always stream.c -o stream.nta.exe&lt;/P&gt;

	&lt;P&gt;export OMP_NUM_THREADS=6&lt;/P&gt;

	&lt;P&gt;export KMP_AFFINITY=verbose,compact&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp; #&amp;nbsp; change to "verbose,compact,1" if HyperThreading is enabled&lt;/P&gt;

	&lt;P&gt;numactl --membind=0 ./stream.nta.exe&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; # all local accesses on socket 0, streaming stores enabled&lt;/P&gt;

	&lt;P&gt;numactl --membind=1 ./stream.nta.exe&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; # all threads on socket 0, all data on socket 1, with streaming stores&lt;/P&gt;

	&lt;P&gt;export OMP_NUM_THREADS=12&lt;/P&gt;

	&lt;P&gt;export KMP_AFFINITY=verbose,scatter&lt;/P&gt;

	&lt;P&gt;./stream.nta.exe&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; # use both sockets, all accesses should be local, with streaming stores&lt;/P&gt;

	&lt;P&gt;numactl --interleave=0,1 ./stream.nta.exe&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; # use both sockets, memory alternates between sockets by 4KiB page&lt;/P&gt;

	&lt;P&gt;&amp;nbsp;&lt;/P&gt;

	&lt;P&gt;icc -O3 -xAVX2 -openmp -ffreestanding -DSTREAM_ARRAY_SIZE=80000000 -opt-streaming-stores never stream.c -o stream.alloc.exe&lt;/P&gt;

	&lt;P&gt;export OMP_NUM_THREADS=6&lt;/P&gt;

	&lt;P&gt;export KMP_AFFINITY=verbose,compact&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp; #&amp;nbsp; change to "verbose,compact,1" if HyperThreading is enabled&lt;/P&gt;

	&lt;P&gt;numactl --membind=0 ./stream.alloc.exe&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; # all local accesses on socket 0, no streaming stores&lt;/P&gt;

	&lt;P&gt;numactl --membind=1 ./stream.alloc.exe&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; # all threads on socket 0, all data on socket 1, no streaming stores&lt;/P&gt;

	&lt;P&gt;export OMP_NUM_THREADS=12&lt;/P&gt;

	&lt;P&gt;export KMP_AFFINITY=verbose,scatter&lt;/P&gt;

	&lt;P&gt;./stream.alloc.exe&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; # use both sockets, all accesses should be local, no streaming stores&lt;/P&gt;

	&lt;P&gt;numactl --interleave=0,1 ./stream.alloc.exe&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; # use both sockets, memory alternates between sockets by 4KiB page&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;

&lt;P&gt;If my testing on the Xeon E5-2603 v3 is any indication, your results using all the cores on a single socket and memory interleaved across the two chips should be somewhere in the range of 10 GB/s -- about 5 GB/s from local memory and about 5 GB/s over QPI from the other socket.&amp;nbsp; This assumes a slightly lower QPI than the 6 GB/s you reported based on my observations that streaming stores are unusually slow on these low-frequency parts.&amp;nbsp; On my Xeon E5-2603 v3, the Intel Memory Latency Checker showed a pattern like:&lt;/P&gt;

&lt;BLOCKQUOTE&gt;
	&lt;P&gt;Using Read-only traffic type&lt;BR /&gt;
		&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; Memory node&lt;BR /&gt;
		&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; Socket&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 0&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp; 1&amp;nbsp;&amp;nbsp; &amp;nbsp;&lt;BR /&gt;
		&amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; 0&amp;nbsp;&amp;nbsp; &amp;nbsp; 26893.0&amp;nbsp;&amp;nbsp; &amp;nbsp; 4713.0&amp;nbsp;&amp;nbsp; &amp;nbsp;&lt;BR /&gt;
		&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 1&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; 4688.6&amp;nbsp;&amp;nbsp; &amp;nbsp;27011.7&amp;nbsp;&amp;nbsp; &amp;nbsp;&lt;/P&gt;

	&lt;P&gt;&amp;nbsp;&lt;/P&gt;

	&lt;P&gt;Using "-W5" (one read and one write)&lt;/P&gt;

	&lt;P&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; Memory node&lt;BR /&gt;
		&amp;nbsp;&amp;nbsp;&amp;nbsp; Socket&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 0&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp; 1&amp;nbsp;&amp;nbsp; &amp;nbsp;&lt;BR /&gt;
		&amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; 0&amp;nbsp;&amp;nbsp; &amp;nbsp; 34972.4&amp;nbsp;&amp;nbsp; &amp;nbsp; 5613.6&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;lt;-- remote is about 20% faster than all reads&lt;BR /&gt;
		&amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; 1&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; 5627.9&amp;nbsp;&amp;nbsp; &amp;nbsp;34708.2&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;/P&gt;

	&lt;P&gt;&amp;nbsp;&lt;/P&gt;

	&lt;P&gt;Using "-W8" (one read and one non-temporal write)&lt;/P&gt;

	&lt;P&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;&amp;nbsp; Memory node&lt;BR /&gt;
		&amp;nbsp;&amp;nbsp;&amp;nbsp; Socket&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 0&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp; 1&amp;nbsp;&amp;nbsp; &amp;nbsp;&lt;BR /&gt;
		&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 0&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; 22906.1&amp;nbsp;&amp;nbsp; &amp;nbsp; 3740.9&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;lt;-- remote is about 20% slower than all reads&lt;BR /&gt;
		&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 1&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; 3765.2&amp;nbsp;&amp;nbsp; &amp;nbsp;22984.6&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;

&lt;BLOCKQUOTE&gt;
	&lt;P&gt;&amp;nbsp;&lt;/P&gt;

	&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 03 Dec 2015 20:28:56 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Memory-bandwidth-on-a-NUMA-system/m-p/1095838#M5727</guid>
      <dc:creator>McCalpinJohn</dc:creator>
      <dc:date>2015-12-03T20:28:56Z</dc:date>
    </item>
    <item>
      <title>Dr. McCalpin,</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Memory-bandwidth-on-a-NUMA-system/m-p/1095839#M5728</link>
      <description>&lt;P&gt;Dr. McCalpin,&lt;/P&gt;

&lt;P&gt;MLC does not change the frequency. It only eanbles all the hw prefetchers for bandwidth measurements and disables them if latency is measured. In the case of measuring cross-socket b/w, it will use all the cores on a socket and allocate memory on the other socket. Basically, a thread will attach itself to a core on the other socket, allocate memory, do first touch (to assure memory is coming from the other socket) and then pin to cores on the first socket and access the allocated memory. This will result in all references going to the remote socket assuming NUMA is enabled.&lt;/P&gt;

&lt;P&gt;However, if NUMA is disabled, then the system would do 64 byte interleaving across sockets. That is, address x would be on socket0 while address x+64 would be from socket1. MLC does not do anything differently in this case. It would still have threads pinned to other socket, allocate memory and do a first touch. However, the memory allocated would come equally from both sockets as NUMA is disabled&lt;/P&gt;

&lt;P&gt;The result that CodeMonkey is reporting for numa disabled is still possible as the resources to track the requests are differently allocated based on NUMA on or off. Basically the caching agent has resources that are used to track outstanding requests (misses). Requests to remote memory take longer than local memory and we typically want to provide more local memory b/w. Using various considerations, the resources are divided differently between local and remote requests. And this division is different between NUMA enabled vs disabled.&lt;/P&gt;

&lt;P&gt;Vish&lt;/P&gt;</description>
      <pubDate>Thu, 03 Dec 2015 21:51:57 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Memory-bandwidth-on-a-NUMA-system/m-p/1095839#M5728</guid>
      <dc:creator>Krishnaswa_V_Intel</dc:creator>
      <dc:date>2015-12-03T21:51:57Z</dc:date>
    </item>
    <item>
      <title>Vish, John,</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Memory-bandwidth-on-a-NUMA-system/m-p/1095840#M5729</link>
      <description>&lt;P&gt;Vish, John,&lt;/P&gt;

&lt;P&gt;Thanks for your elaborated replies!&lt;BR /&gt;
	I disabled Early Snooping in the BIOS and the bandwidth&amp;nbsp;skyrocketed, paying around 10% penalty in latency, as Vish noted.&lt;/P&gt;

&lt;BLOCKQUOTE&gt;
	&lt;P&gt;Intel(R) Memory Latency Checker - v3.0&lt;BR /&gt;
		Measuring idle latencies (in ns)...&lt;BR /&gt;
		&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Numa node&lt;BR /&gt;
		Numa node &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;0 &amp;nbsp; &amp;nbsp; &amp;nbsp; 1&lt;BR /&gt;
		&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;0 &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;85.7 &amp;nbsp; 128.9&lt;BR /&gt;
		&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;1 &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; 128.0 &amp;nbsp; &amp;nbsp;85.6&lt;/P&gt;

	&lt;P&gt;Measuring Peak Memory Bandwidths for the system&lt;BR /&gt;
		Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)&lt;BR /&gt;
		Using all the threads from each core if Hyper-threading is enabled&lt;BR /&gt;
		Using traffic with the following read-write ratios&lt;BR /&gt;
		ALL Reads &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;: &amp;nbsp; &amp;nbsp; &amp;nbsp;86216.5&lt;BR /&gt;
		3:1 Reads-Writes : &amp;nbsp; &amp;nbsp; &amp;nbsp;86969.6&lt;BR /&gt;
		2:1 Reads-Writes : &amp;nbsp; &amp;nbsp; &amp;nbsp;88117.0&lt;BR /&gt;
		1:1 Reads-Writes : &amp;nbsp; &amp;nbsp; &amp;nbsp;87380.4&lt;BR /&gt;
		Stream-triad like: &amp;nbsp; &amp;nbsp; &amp;nbsp;80178.1&lt;/P&gt;

	&lt;P&gt;Measuring Memory Bandwidths between nodes within system&lt;BR /&gt;
		Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)&lt;BR /&gt;
		Using all the threads from each core if Hyper-threading is enabled&lt;BR /&gt;
		Using Read-only traffic type&lt;BR /&gt;
		&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Numa node&lt;BR /&gt;
		Numa node &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;0 &amp;nbsp; &amp;nbsp; &amp;nbsp; 1&lt;BR /&gt;
		&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;0 &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;43272.7 25336.9&lt;BR /&gt;
		&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;1 &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;25346.7 43259.0&lt;/P&gt;

	&lt;P&gt;Measuring Loaded Latencies for the system&lt;BR /&gt;
		Using all the threads from each core if Hyper-threading is enabled&lt;BR /&gt;
		Using Read-only traffic type&lt;BR /&gt;
		Inject &amp;nbsp;Latency Bandwidth&lt;BR /&gt;
		Delay &amp;nbsp; (ns) &amp;nbsp; &amp;nbsp;MB/sec&lt;BR /&gt;
		==========================&lt;BR /&gt;
		&amp;nbsp;00000 &amp;nbsp;150.63 &amp;nbsp; &amp;nbsp;86501.6&lt;BR /&gt;
		&amp;nbsp;00002 &amp;nbsp;150.88 &amp;nbsp; &amp;nbsp;86481.6&lt;BR /&gt;
		&amp;nbsp;00008 &amp;nbsp;149.23 &amp;nbsp; &amp;nbsp;86652.2&lt;BR /&gt;
		&amp;nbsp;00015 &amp;nbsp;148.95 &amp;nbsp; &amp;nbsp;86548.9&lt;BR /&gt;
		&amp;nbsp;00050 &amp;nbsp;115.06 &amp;nbsp; &amp;nbsp;73755.5&lt;BR /&gt;
		&amp;nbsp;00100 &amp;nbsp;104.84 &amp;nbsp; &amp;nbsp;56534.0&lt;BR /&gt;
		&amp;nbsp;00200 &amp;nbsp; 96.78 &amp;nbsp; &amp;nbsp;31569.2&lt;BR /&gt;
		&amp;nbsp;00300 &amp;nbsp; 93.22 &amp;nbsp; &amp;nbsp;21907.8&lt;BR /&gt;
		&amp;nbsp;00400 &amp;nbsp; 91.18 &amp;nbsp; &amp;nbsp;16864.0&lt;BR /&gt;
		&amp;nbsp;00500 &amp;nbsp; 90.30 &amp;nbsp; &amp;nbsp;13768.8&lt;BR /&gt;
		&amp;nbsp;00700 &amp;nbsp; 89.12 &amp;nbsp; &amp;nbsp;10153.8&lt;BR /&gt;
		&amp;nbsp;01000 &amp;nbsp; 88.35 &amp;nbsp; &amp;nbsp; 7381.2&lt;BR /&gt;
		&amp;nbsp;01300 &amp;nbsp; 87.46 &amp;nbsp; &amp;nbsp; 5870.2&lt;BR /&gt;
		&amp;nbsp;01700 &amp;nbsp; 87.36 &amp;nbsp; &amp;nbsp; 4674.6&lt;BR /&gt;
		&amp;nbsp;02500 &amp;nbsp; 87.43 &amp;nbsp; &amp;nbsp; 3419.3&lt;BR /&gt;
		&amp;nbsp;03500 &amp;nbsp; 86.27 &amp;nbsp; &amp;nbsp; 2664.8&lt;BR /&gt;
		&amp;nbsp;05000 &amp;nbsp; 86.09 &amp;nbsp; &amp;nbsp; 2091.0&lt;BR /&gt;
		&amp;nbsp;09000 &amp;nbsp; 85.86 &amp;nbsp; &amp;nbsp; 1494.9&lt;BR /&gt;
		&amp;nbsp;20000 &amp;nbsp; 85.89 &amp;nbsp; &amp;nbsp; 1082.7&lt;/P&gt;

	&lt;P&gt;Measuring cache-to-cache transfer latency (in ns)...&lt;BR /&gt;
		Local Socket L2-&amp;gt;L2 HIT &amp;nbsp;latency &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;31.8&lt;BR /&gt;
		Local Socket L2-&amp;gt;L2 HITM latency &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;36.5&lt;BR /&gt;
		Remote Socket LLC-&amp;gt;LLC HITM latency (data address homed in writer socket)&lt;BR /&gt;
		&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Reader Numa Node&lt;BR /&gt;
		Writer Numa Node &amp;nbsp; &amp;nbsp; 0 &amp;nbsp; &amp;nbsp; &amp;nbsp; 1&lt;BR /&gt;
		&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; 0 &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;- &amp;nbsp; &amp;nbsp;77.8&lt;BR /&gt;
		&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; 1 &amp;nbsp; &amp;nbsp; 78.0 &amp;nbsp; &amp;nbsp; &amp;nbsp; -&lt;BR /&gt;
		Remote Socket LLC-&amp;gt;LLC HITM latency (data address homed in reader socket)&lt;BR /&gt;
		&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Reader Numa Node&lt;BR /&gt;
		Writer Numa Node &amp;nbsp; &amp;nbsp; 0 &amp;nbsp; &amp;nbsp; &amp;nbsp; 1&lt;BR /&gt;
		&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; 0 &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;- &amp;nbsp; &amp;nbsp;76.8&lt;BR /&gt;
		&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; 1 &amp;nbsp; &amp;nbsp; 77.0 &amp;nbsp; &amp;nbsp; &amp;nbsp; -&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;

&lt;P&gt;Thanks,&lt;BR /&gt;
	Ilya&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Sun, 06 Dec 2015 14:28:48 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Memory-bandwidth-on-a-NUMA-system/m-p/1095840#M5729</guid>
      <dc:creator>Ilya_M_1</dc:creator>
      <dc:date>2015-12-06T14:28:48Z</dc:date>
    </item>
    <item>
      <title>Is there a way to check from</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Memory-bandwidth-on-a-NUMA-system/m-p/1095841#M5730</link>
      <description>&lt;P&gt;Is there a way to check from within Linux the snooping method configured in the BIOS? I.e. MSR or anything else?&lt;/P&gt;

&lt;P&gt;Thanks,&lt;BR /&gt;
	Ilya&lt;/P&gt;</description>
      <pubDate>Mon, 21 Dec 2015 13:21:12 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Memory-bandwidth-on-a-NUMA-system/m-p/1095841#M5730</guid>
      <dc:creator>Ilya_M_1</dc:creator>
      <dc:date>2015-12-21T13:21:12Z</dc:date>
    </item>
    <item>
      <title>I don't know of any</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Memory-bandwidth-on-a-NUMA-system/m-p/1095842#M5731</link>
      <description>&lt;P&gt;I don't know of any programmatic way of checking this definitively.&amp;nbsp;&amp;nbsp; I usually just check the local memory latency and guess which mode it is in....&lt;/P&gt;

&lt;P&gt;I seem to recall that some of the uncore counters behave differently with home snoop vs early snoop.&amp;nbsp; In the QPI Link Layer counters, the RxL_FLITS_G1.SNOOP count is zero when home snoop is enabled (because the chip receives read requests rather than snoop requests in this mode).&amp;nbsp; If it gets non-zero counts when in "early snoop" mode, then it could be used as a differentiator....&amp;nbsp;&amp;nbsp; I don't have a machine that I can test this on right now...&lt;/P&gt;</description>
      <pubDate>Mon, 21 Dec 2015 20:11:28 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Memory-bandwidth-on-a-NUMA-system/m-p/1095842#M5731</guid>
      <dc:creator>McCalpinJohn</dc:creator>
      <dc:date>2015-12-21T20:11:28Z</dc:date>
    </item>
  </channel>
</rss>

