<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Hello Amir. in Software Tuning, Performance Optimization &amp; Platform Monitoring</title>
    <link>https://community.intel.com/t5/Software-Tuning-Performance/Sandy-bridge-performance-degradation-compare-to-Westmere/m-p/941389#M1970</link>
    <description>Hello Amir.
Below I summarize our results so far.
Looking at table 1, it seems that the latency of your wsm (85.707ns) and snb (89.065ns) systems is about the same.
The frequency of your wsm system is about 1.148x higher than the snb box.
In your memtest main.cpp, it seems like the 2 main components of the time are a) the random number genration and b) the loading of the random memory location.
Given that your memory latencies look about equal, I wonder how much of the difference is due to the higher wsm frequency.
If you want to test this, there are 2 ways:
1) change the frequency of the cpus (see the attached how_to_change_frequency_on_linux_pub.txt file) or
2) move the 'generate the random numbers' out of the timing loop.
For 2), you can see my win_main.cpp which is a modified for windows version of your main.cpp. I put the random numbers into an array.

I'm sorry that the 2 systems I used were not more similar and that they were not linux.
Pat


Amir's 2 systems:
wsm-ep tix2 - cpu X5680 3.33GHz, mother board - Supermicro X8DTU , memory - 64GB divided 32GB to each bank at 1.33GHz
snb-ep tix8 - cpu E5-2690 2.90GHz, mother board - Intel S2600GZ, memory - 64GB divided 32GB to each bank at 1.60GHz
Frequency ratio wsm/snb = 1.148x

Table 1 below
Amir running lmbench lat_mem_rd -t (random memory accesses)	
system	prefetch	spin	turbo	random	latency(ns)	Best snb/wsm
snb	off	on	on	on	89.065	1.039179997x
wsm	off	?	?	on	85.707			via private msg


Table 2 below
Amir running his memtest microkernel
system	prefetch	spin	turbo	random	time(secs)	Best snb/wsm
snb	off	on	on	on	8.084655008
snb	off	off	on	on	8.992366456
wsm	on	off	?	on	6.617275213	1.221749851x


Pat's systems:
wsm-ep - cpu L5640  @ 2.27GHz, mother board - Intel S5500WB, memory - 12GB total divided 2GB per channel, 3 DIMMs per node  at 1.33GHz
snb-ep - cpu @ 2.70GHz, cpuid signature 0x206d5, mother board - ASUSTek Z9PP-D24, memory - 64GB total divided 8GB per channel, 4 DIMMs per node  at 1.60GHz
Frequency ratio wsm/snb = 1.189x

Table 3 below
Pat running a modified version of Amir's memtest
modified memtest now generates random numbers outside of timing loop	
system	prefetch	spin	turbo	random	time(secs)	Best snb/wsm	
snb	off	on	on	on	6.41873	
wsm	off	on	on	on	7.02422	1.094331745x
	
	
Table 4 below.
Pat running a memory latency test with a random memory access	
system	prefetch	spin	turbo	random	latency(ns)	Best wsm/snb
snb	off	off	on	on	96.714			
snb	off	on	on	on	87.844	
wsm	off	off	on	on	99.976	1.138108465x</description>
    <pubDate>Wed, 21 Nov 2012 04:29:26 GMT</pubDate>
    <dc:creator>Patrick_F_Intel1</dc:creator>
    <dc:date>2012-11-21T04:29:26Z</dc:date>
    <item>
      <title>Sandy bridge performance degradation compare to Westmere</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Sandy-bridge-performance-degradation-compare-to-Westmere/m-p/941352#M1933</link>
      <description>&lt;P&gt;&lt;SPAN&gt;i created a simple memtest that allocate large vector that get random number and update the vector data.&lt;/SPAN&gt;&lt;BR /&gt;pseudo code&amp;nbsp;&lt;BR /&gt;&lt;BR /&gt;DataCell* dataCells = new DataCell[VECTOR_SIZE]&lt;BR /&gt;for(int cycles = 0; cycles &amp;lt; gCycles; cycles++){&amp;nbsp; &amp;nbsp; u64 randVal = random()&lt;BR /&gt;&amp;nbsp; &amp;nbsp; DataCell* dataCell =&amp;nbsp;dataCells[randVal %&amp;nbsp;VECTOR_SIZE]&lt;/P&gt;
&lt;P&gt;&amp;nbsp; &amp;nbsp; dataCell-&amp;gt;m_count = cycles&lt;/P&gt;
&lt;P&gt;&amp;nbsp; &amp;nbsp; dataCell-&amp;gt;m_random = randVal&lt;/P&gt;
&lt;P&gt;&amp;nbsp; &amp;nbsp; dataCell-&amp;gt;m_flag = 1&lt;/P&gt;
&lt;P&gt;}&lt;/P&gt;
&lt;P&gt;&lt;BR /&gt;i'm using perf util to gather performance counter info.&lt;BR /&gt;the most interesting results are when the vector size is larger then the cache size tix8 20MB tix2 12MB&amp;nbsp;&lt;/P&gt;
&lt;P&gt;hardware specification&lt;/P&gt;
&lt;P&gt;tix2 - cpu X5680 3.33GHz, mother board - Supermicro X8DTU , memory - 64GB divided 32GB to each bank at 1.33GHz&lt;/P&gt;
&lt;P&gt;tix8 - cpu E5-2690 2.90GHz, mother board - Intel S2600GZ, memory&amp;nbsp;- 64GB divided 32GB to each bank at 1.60GHz&lt;/P&gt;
&lt;P&gt;compiled with gcc 4.6.1 -O3 -mtune=native -march=native&lt;/P&gt;
&lt;P&gt;amk@tix2:~/amir/memtest$ perf stat -e cycles:u -e instructions:u -e l1-dcache-loads:u -e l1-dcache-load-misses:u -e L1-dcache-stores:u -e L1-dcache-store-misses:u ./memtest -v 10000019 -c 100000000&lt;BR /&gt;total Time (rdtsc) 21800971556 nano time 6542908630 vector size 240000456&lt;/P&gt;
&lt;P&gt;Performance counter stats for './memtest -v 10000019 -c 100000000':&lt;/P&gt;
&lt;P&gt;21842742688 cycles # 0.000 M/sec&lt;BR /&gt; 5869556879 instructions # 0.269 IPC &lt;BR /&gt; 1700665337 L1-dcache-loads # 0.000 M/sec&lt;BR /&gt; 221870903 L1-dcache-load-misses # 0.000 M/sec&lt;BR /&gt; 1130278738 L1-dcache-stores # 0.000 M/sec&lt;BR /&gt; 0 L1-dcache-store-misses # 0.000 M/sec&lt;/P&gt;
&lt;P&gt;6.628680493 seconds time elapsed&lt;/P&gt;
&lt;P&gt;amk@tix8:~/amir/memtest$ perf stat -e cycles:u -e instructions:u -e l1-dcache-loads:u -e l1-dcache-load-misses:u -e L1-dcache-stores:u -e L1-dcache-store-misses:u ./memtest -v 10000019 -c 100000000&lt;BR /&gt;total Time (rdtsc) 24362574412 nano time 8424126698 vector size 240000456&lt;/P&gt;
&lt;P&gt;Performance counter stats for './memtest -v 10000019 -c 100000000':&lt;/P&gt;
&lt;P&gt;24409499958 cycles # 0.000 M/sec&lt;BR /&gt; 5869656821 instructions # 0.240 IPC &lt;BR /&gt; 1192635035 L1-dcache-loads # 0.000 M/sec&lt;BR /&gt; 94702716 L1-dcache-load-misses # 0.000 M/sec&lt;BR /&gt; 1373779283 L1-dcache-stores # 0.000 M/sec&lt;BR /&gt; 306775598 L1-dcache-store-misses # 0.000 M/sec&lt;/P&gt;
&lt;P&gt;8.525456817 seconds time elapsed&lt;/P&gt;
&lt;P&gt;what am is missing is Sandy bridge slower then Westmere ???????&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;
&lt;P&gt;Amir.&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 01 Nov 2012 11:05:12 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Sandy-bridge-performance-degradation-compare-to-Westmere/m-p/941352#M1933</guid>
      <dc:creator>amk21</dc:creator>
      <dc:date>2012-11-01T11:05:12Z</dc:date>
    </item>
    <item>
      <title>I am having similar problems</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Sandy-bridge-performance-degradation-compare-to-Westmere/m-p/941353#M1934</link>
      <description>I am having similar problems with our own proprietary application.  I haven't been able to isolate the problematic code to a nice tidy snippet like you have.

Just looking at your program run times, 8.5 seconds versus 6.6 seconds, that's about a 23% difference.

For the purpose of testing and eliminating variables, I suggest trying the following:
- configure the BIOS in both systems to disable any and all power-saving features (C-states, C1E, memory power saving, etc)
- enable turbo boost on both
- use the idle=poll kernel commandline param on your Sandy Bridge server (needed, as the BIOS settings alone won't keep the CPU from leaving C0 state)

In this setup, you can use the "i7z" program to see what speed all your cores are running at.  At least on my systems, taking all the above steps results in all cores constantly running above their "advertised" clock speed, i.e. turbo boost is kicking in.

Yes, this will make the servers run hot and use lots of power.  :)

These are tunings for a low-latency environment, but I think they might be appropriate for testing/experimenting in your case.  At least, if you do these things, and see the difference between Westmere and Sandy Bridge narrow, then you can attribute it to one of these tweaks.  At least in my low-latency world, the aggressive power-saving features are bad for performance.  Just a random guess here, but: perhaps your application is such that, during execution, it allows the CPU to drop into some kind of a sleep state many times.  There is a latency penalty for coming out of a sleep state.  If you drop in and out of sleep states many times during execution, you might see a cumulative effect in increased overall runtime.</description>
      <pubDate>Thu, 01 Nov 2012 15:42:31 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Sandy-bridge-performance-degradation-compare-to-Westmere/m-p/941353#M1934</guid>
      <dc:creator>matt_garman</dc:creator>
      <dc:date>2012-11-01T15:42:31Z</dc:date>
    </item>
    <item>
      <title>all power saving is disabled</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Sandy-bridge-performance-degradation-compare-to-Westmere/m-p/941354#M1935</link>
      <description>all power saving is disabled hyper thread is disabled  i7m report cpu frequency of 3290.1  but the performance is even worse 

amk@tix8:~/amir/memtest$  perf stat -c -e cycles:u   -e instructions:u   -e l1-dcache-loads:u   -e l1-dcache-load-misses:u   -e L1-dcache-stores:u -e L1-dcache-store-misses:u  ./memtest -v 10000019 -c 100000000
total Time (rdtsc) 25724416756 nano time 21437013963 vector size 240000456

 Performance counter stats for './memtest -v 10000019 -c 100000000':

    29300656750  cycles                   #      0.000 M/sec
     5869414958  instructions             #      0.200 IPC  
     1190853811  L1-dcache-loads          #      0.000 M/sec
       94650151  L1-dcache-load-misses    #      0.000 M/sec
     1379446403  L1-dcache-stores         #      0.000 M/sec
      306750238  L1-dcache-store-misses   #      0.000 M/sec

    8.990783606  seconds time elapsed


the results bellow is without turbo boost !!!!

amk@tix8:~/amir/memtest$ perf stat -c -e cycles:u   -e instructions:u   -e l1-dcache-loads:u   -e l1-dcache-load-misses:u   -e L1-dcache-stores:u -e L1-dcache-store-misses:u  ./memtest -v 10000019 -c 100000000
total Time (rdtsc) 24314509968 nano time 8404600749 vector size 240000456

 Performance counter stats for './memtest -v 10000019 -c 100000000':

    24360101110  cycles                   #      0.000 M/sec
     5869421474  instructions             #      0.241 IPC  
     1191790678  L1-dcache-loads          #      0.000 M/sec
       94483286  L1-dcache-load-misses    #      0.000 M/sec
     1374772009  L1-dcache-stores         #      0.000 M/sec
      306899965  L1-dcache-store-misses   #      0.000 M/sec

    8.506839690  seconds time elapsed</description>
      <pubDate>Thu, 01 Nov 2012 19:05:56 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Sandy-bridge-performance-degradation-compare-to-Westmere/m-p/941354#M1935</guid>
      <dc:creator>amk21</dc:creator>
      <dc:date>2012-11-01T19:05:56Z</dc:date>
    </item>
    <item>
      <title>The web sites appear to</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Sandy-bridge-performance-degradation-compare-to-Westmere/m-p/941355#M1936</link>
      <description>The web sites appear to confirm that those motherboards are the usual full featured ones, with 8 channels on Sandy Bridge and 6 on Westmere.
My E5-2670 has 1 stick in each channel.  I do see lower performance than the 5680 on operations where performance is proportional to clock speed and doesn't need the superior memory system.
I suppose gcc 4.6 doesn't use nontemporal stores directly, and I guess you have excluded use of simd instructions.</description>
      <pubDate>Thu, 01 Nov 2012 19:40:51 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Sandy-bridge-performance-degradation-compare-to-Westmere/m-p/941355#M1936</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2012-11-01T19:40:51Z</dc:date>
    </item>
    <item>
      <title>The web sites appear to</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Sandy-bridge-performance-degradation-compare-to-Westmere/m-p/941356#M1937</link>
      <description>The web sites appear to confirm that those motherboards are the usual full featured ones, with 8 channels on Sandy Bridge and 6 on Westmere.
My E5-2670 has 1 stick in each channel.  I do see lower performance than the 5680 on operations where performance is proportional to clock speed and doesn't need the superior memory system.
I suppose gcc 4.6 doesn't use nontemporal stores directly, and I guess you have excluded use of simd instructions.</description>
      <pubDate>Thu, 01 Nov 2012 19:43:10 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Sandy-bridge-performance-degradation-compare-to-Westmere/m-p/941356#M1937</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2012-11-01T19:43:10Z</dc:date>
    </item>
    <item>
      <title>i don't fully understand your</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Sandy-bridge-performance-degradation-compare-to-Westmere/m-p/941357#M1938</link>
      <description>i don't fully understand your answer 

first of all we are using E5-2690 is the answer "E5-2670 has 1 stick in each channel" relevant to this cpu ?

what i understand from your answer  ("I do see lower performance than the 5680 on operations where performance") is that  Sandy bridge (E5-2690) is slower then Westmere (5680)  on the pseudo code i wrote previously (i can supply code for this test), and there is nothing i can do in order to solve this issue (change compiler, change compile flags, change bios settings ....)</description>
      <pubDate>Thu, 01 Nov 2012 20:22:45 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Sandy-bridge-performance-degradation-compare-to-Westmere/m-p/941357#M1938</guid>
      <dc:creator>amk21</dc:creator>
      <dc:date>2012-11-01T20:22:45Z</dc:date>
    </item>
    <item>
      <title>Hello amk21,</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Sandy-bridge-performance-degradation-compare-to-Westmere/m-p/941358#M1939</link>
      <description>Hello amk21,
For the pseudo-code where you pick an index into the array... I assume that random() returns something in the range of VECTOR_SIZE.

The test that you've generated is sort of a memory latency test.
I say 'sort of' because the usual latency test uses linked list of dependent addresses (so that only one load is outstanding at a time).
Doing a random list can generate more than one load outstanding at a time.

Do you know if the prefetchers are disabled in the BIOS? 
If one system has the prefetchers enabled and another system has them disabled, things can get confusing.

Do you have 2 processors on the system or just 1 chip?
If you have more than 1 chip, do you know if NUMA is enabled on both systems?

For latency tests, it is better to have the prefetchers disabled (just to make thinks simpler).

If both systems are configured optimally, I would expect the sandybridge-based system (tix8) to have lower latency than the westmere-based system (tix2). Optimally means 1 DIMM per slot and numa enabled (if there is more than 1 processor).

Are you running on Windows? 
If so, the cpu-z folks have a memory latency tool that you could run to see if their tool get similar results to what you are seeing.
Try running the latency.exe in &lt;A href="http://www.cpuid.com/medias/files/softwares/misc/latency.zip" target="_blank"&gt;http://www.cpuid.com/medias/files/softwares/misc/latency.zip&lt;/A&gt;
If you could send the output.

On linux you can use lmbench to get latency... see &lt;A href="http://sourceforge.net/projects/lmbench/" target="_blank"&gt;http://sourceforge.net/projects/lmbench/&lt;/A&gt;
But I'm not too familiar with lmbench so i can't help too much with running instructions.

Running these industry standard benchmarks will give us more information on the relative performance of the systems.
Pat</description>
      <pubDate>Fri, 02 Nov 2012 01:50:19 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Sandy-bridge-performance-degradation-compare-to-Westmere/m-p/941358#M1939</guid>
      <dc:creator>Patrick_F_Intel1</dc:creator>
      <dc:date>2012-11-02T01:50:19Z</dc:date>
    </item>
    <item>
      <title>i have 2 processors on the</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Sandy-bridge-performance-degradation-compare-to-Westmere/m-p/941359#M1940</link>
      <description>i have 2 processors on the system and numa is enabled.

i'll verify tix2 bios setting and run lmbench but i need the simple memtest because it simulate my application i have a very large map that is actually 2 dimension vector and i found out that the finding the right ling in the map is the most costly operation.

"For latency tests, it is better to have the prefetchers disabled" - what bios setting did you had in mind ? 
disabled data prefetcher
Numa optimized - Enabled
MLC streamer - Enabled
MLC spatial prefetcher - Enabled
DCU Data prefetcher - Disabled
DCU instruction prefetcher - Enabled

amk@tix8:~/amir/memtest$ perf stat -c -e cycles:u   -e instructions:u   -e l1-dcache-loads:u   -e l1-dcache-load-misses:u   -e L1-dcache-stores:u -e L1-dcache-store-misses:u  ./memtest -v 10000019 -c 100000000
total Time (rdtsc) 24457792172 nano time 8457051235 vector size 240000456

 Performance counter stats for './memtest -v 10000019 -c 100000000':

    24504834353  cycles                   #      0.000 M/sec
     5869424898  instructions             #      0.240 IPC  
     1193553992  L1-dcache-loads          #      0.000 M/sec
       94548506  L1-dcache-load-misses    #      0.000 M/sec
     1370182667  L1-dcache-stores         #      0.000 M/sec
      306627891  L1-dcache-store-misses   #      0.000 M/sec

    8.559050619  seconds time elapsed

disabled data prefetcher and numa optimized
Numa optimized - Disabled
MLC streamer - Enabled
MLC spatial prefetcher - Enabled
DCU Data prefetcher - Disabled
DCU instruction prefetcher - Enabled

amk@tix8:~/amir/memtest$ perf stat -c -e cycles:u   -e instructions:u   -e l1-dcache-loads:u   -e l1-dcache-load-misses:u   -e L1-dcache-stores:u -e L1-dcache-store-misses:u  ./memtest -v 10000019 -c 100000000
total Time (rdtsc) 33150418300 nano time 11462800242 vector size 240000456

 Performance counter stats for './memtest -v 10000019 -c 100000000':

    33191154216  cycles                   #      0.000 M/sec
     5869420947  instructions             #      0.177 IPC  
     1190593871  L1-dcache-loads          #      0.000 M/sec
       94498148  L1-dcache-load-misses    #      0.000 M/sec
     1382188152  L1-dcache-stores         #      0.000 M/sec
      306662218  L1-dcache-store-misses   #      0.000 M/sec

   11.568955857  seconds time elapsed

disabled numa optimized
Numa optimized - Disabled
MLC streamer - Enabled
MLC spatial prefetcher - Enabled
DCU Data prefetcher - Enabled
DCU instruction prefetcher - Enabled



amk@tix8:~/amir/memtest$ perf stat -c -e cycles:u   -e instructions:u   -e l1-dcache-loads:u   -e l1-dcache-load-misses:u   -e L1-dcache-stores:u -e L1-dcache-store-misses:u  ./memtest -v 10000019 -c 100000000
total Time (rdtsc) 33150283136 nano time 11462753504 vector size 240000456

 Performance counter stats for './memtest -v 10000019 -c 100000000':

    33190933585  cycles                   #      0.000 M/sec
     5869420768  instructions             #      0.177 IPC  
     1190685322  L1-dcache-loads          #      0.000 M/sec
       94769556  L1-dcache-load-misses    #      0.000 M/sec
     1382058359  L1-dcache-stores         #      0.000 M/sec
      306649458  L1-dcache-store-misses   #      0.000 M/sec

   11.569743183  seconds time elapsed</description>
      <pubDate>Fri, 02 Nov 2012 07:41:50 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Sandy-bridge-performance-degradation-compare-to-Westmere/m-p/941359#M1940</guid>
      <dc:creator>amk21</dc:creator>
      <dc:date>2012-11-02T07:41:50Z</dc:date>
    </item>
    <item>
      <title>Hi,</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Sandy-bridge-performance-degradation-compare-to-Westmere/m-p/941360#M1941</link>
      <description>Hi,

regarding the latency measurement using lmbench.

Build utility called "lat_mem_rd" in the package. Then:

numactl --cpunodebind=0 --membind=1 ./lat_mem_rd -t 1024
to measure the memory access latency between NUMA node 0 and 1. The latency test increases the
working set and converges towards the end to the memory latency.

numactl --cpunodebind=0 --membind=0 ./lat_mem_rd -t 1024
to measure the local memory latency on NUMA node 0.

--
Roman</description>
      <pubDate>Fri, 02 Nov 2012 13:24:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Sandy-bridge-performance-degradation-compare-to-Westmere/m-p/941360#M1941</guid>
      <dc:creator>Roman_D_Intel</dc:creator>
      <dc:date>2012-11-02T13:24:00Z</dc:date>
    </item>
    <item>
      <title>results for: numactl -</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Sandy-bridge-performance-degradation-compare-to-Westmere/m-p/941361#M1942</link>
      <description>results for: numactl --cpunodebind=0 --membind=1 ./lat_mem_rd -t 1024 

tix8 bios setting
Numa optimized - Enabled
MLC streamer - Enabled
MLC spatial prefetcher - Enabled
DCU Data prefetcher - Enabled
DCU instruction prefetcher - Enabled

 amk@tix8:~/amir/lmbench-3.0-a9/bin/x86_64-linux-gnu$ numactl --cpunodebind=0 --membind=1 ./lat_mem_rd -t 1024
"stride=64
0.00049 1.383
0.00098 1.383
0.00195 1.383
0.00293 1.383
0.00391 1.383
0.00586 1.383
0.00781 1.383
0.01172 1.383
0.01562 1.383
0.02344 1.383
0.03125 1.383
0.04688 4.149
0.06250 4.149
0.09375 4.149
0.12500 4.982
0.18750 5.461
0.25000 5.746
0.37500 15.573
0.50000 15.997
0.75000 16.331
1.00000 16.418
1.50000 18.140
2.00000 20.505
3.00000 23.942
4.00000 25.148
6.00000 26.235
8.00000 26.562
12.00000 28.049
16.00000 30.442
24.00000 101.998
32.00000 129.382
48.00000 139.500
64.00000 139.948
96.00000 141.216
128.00000 141.265
192.00000 140.899
256.00000 140.582
384.00000 140.045
512.00000 139.745
768.00000 139.379
1024.00000 139.220


amk@tix2:~/amir/lmbench-3.0-a9/bin/x86_64-linux-gnu$ numactl --cpunodebind=0 --membind=1 ./lat_mem_rd -t 1024
"stride=64
0.00049 1.200
0.00098 1.200
0.00195 1.200
0.00293 1.200
0.00391 1.200
0.00586 1.200
0.00781 1.200
0.01172 1.200
0.01562 1.200
0.02344 1.200
0.03125 1.200
0.04688 3.000
0.06250 3.000
0.09375 3.000
0.12500 3.005
0.18750 3.290
0.25000 4.042
0.37500 16.192
0.50000 16.536
0.75000 16.592
1.00000 16.844
1.50000 18.993
2.00000 20.285
3.00000 23.431
4.00000 24.892
6.00000 25.694
8.00000 26.324
12.00000 53.074
16.00000 108.794
24.00000 121.599
32.00000 124.198
48.00000 124.514
64.00000 125.408
96.00000 125.025
128.00000 124.773
192.00000 124.447
256.00000 124.205
384.00000 123.776
512.00000 123.546
768.00000 123.323
1024.00000 123.189</description>
      <pubDate>Fri, 02 Nov 2012 21:48:44 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Sandy-bridge-performance-degradation-compare-to-Westmere/m-p/941361#M1942</guid>
      <dc:creator>amk21</dc:creator>
      <dc:date>2012-11-02T21:48:44Z</dc:date>
    </item>
    <item>
      <title>results for: numactl -</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Sandy-bridge-performance-degradation-compare-to-Westmere/m-p/941362#M1943</link>
      <description>results for: numactl --cpunodebind=0 --membind=0 ./lat_mem_rd -t 1024

tix8 bios setting
Numa optimized - Enabled
MLC streamer - Enabled
MLC spatial prefetcher - Enabled
DCU Data prefetcher - Enabled
DCU instruction prefetcher - Enabled

amk@tix8:~/amir/lmbench-3.0-a9/bin/x86_64-linux-gnu$ numactl --cpunodebind=0 --membind=0 ./lat_mem_rd -t 1024
"stride=64
0.00049 1.383
0.00098 1.383
0.00195 1.383
0.00293 1.383
0.00391 1.383
0.00586 1.383
0.00781 1.383
0.01172 1.383
0.01562 1.383
0.02344 1.383
0.03125 1.383
0.04688 4.149
0.06250 4.149
0.09375 4.147
0.12500 4.149
0.18750 4.905
0.25000 5.253
0.37500 15.746
0.50000 15.817
0.75000 16.226
1.00000 16.954
1.50000 18.774
2.00000 20.563
3.00000 23.922
4.00000 25.201
6.00000 26.089
8.00000 26.732
12.00000 28.367
16.00000 30.853
24.00000 75.662
32.00000 89.364
48.00000 94.962
64.00000 96.098
96.00000 96.829
128.00000 96.941
192.00000 96.801
256.00000 96.716
384.00000 96.468
512.00000 96.297
768.00000 96.103
1024.00000 95.989


amk@tix2:~/amir/lmbench-3.0-a9/bin/x86_64-linux-gnu$ numactl --cpunodebind=0 --membind=0 ./lat_mem_rd -t 1024
"stride=64
0.00049 1.200
0.00098 1.200
0.00195 1.200
0.00293 1.200
0.00391 1.200
0.00586 1.200
0.00781 1.200
0.01172 1.200
0.01562 1.200
0.02344 1.200
0.03125 1.200
0.04688 3.000
0.06250 3.000
0.09375 3.000
0.12500 3.001
0.18750 3.726
0.25000 4.169
0.37500 16.219
0.50000 16.600
0.75000 16.765
1.00000 16.668
1.50000 18.637
2.00000 20.386
3.00000 23.847
4.00000 25.480
6.00000 29.075
8.00000 31.644
12.00000 53.601
16.00000 74.186
24.00000 83.683
32.00000 85.331
48.00000 86.139
64.00000 86.394
96.00000 87.177
128.00000 87.167
192.00000 87.413
256.00000 87.230
384.00000 87.255
512.00000 86.998
768.00000 87.018
1024.00000 86.786</description>
      <pubDate>Fri, 02 Nov 2012 21:53:17 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Sandy-bridge-performance-degradation-compare-to-Westmere/m-p/941362#M1943</guid>
      <dc:creator>amk21</dc:creator>
      <dc:date>2012-11-02T21:53:17Z</dc:date>
    </item>
    <item>
      <title>adding graph of lat_mem_rd</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Sandy-bridge-performance-degradation-compare-to-Westmere/m-p/941363#M1944</link>
      <description>adding graph of lat_mem_rd results</description>
      <pubDate>Mon, 05 Nov 2012 12:06:59 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Sandy-bridge-performance-degradation-compare-to-Westmere/m-p/941363#M1944</guid>
      <dc:creator>amk21</dc:creator>
      <dc:date>2012-11-05T12:06:59Z</dc:date>
    </item>
    <item>
      <title>Hello amk21,</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Sandy-bridge-performance-degradation-compare-to-Westmere/m-p/941364#M1945</link>
      <description>Hello amk21,
A coworker made a suggestion...
Sandybridge-EP power management is probably putting the 2nd processor into a low power state.
In this low power state the snoops will take longer since the 2nd processor is running at (probably) a low frequency.
Can you try pinning and running this 'spin loop' program on the 2nd processor when you run the latency program on the 1st processor?
The spin.c program... you'll have to kill it with control-c.

#include &lt;STDIO.H&gt;
int main(int argc, char **argv)
{
	int i=0;
	printf("begin spin loop\n");
	while(1) {i++;}
	printf("i= %d\n", i);
	return 0;
}

In order for us to compare your latency numbers with our numbers, you'll need to disable all the prefetchers and enable numa.
But I'd still like to see the impact of the spinner on your 'prefetchers on, numa on' latency.
Pat&lt;/STDIO.H&gt;</description>
      <pubDate>Mon, 05 Nov 2012 22:26:52 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Sandy-bridge-performance-degradation-compare-to-Westmere/m-p/941364#M1945</guid>
      <dc:creator>Patrick_F_Intel1</dc:creator>
      <dc:date>2012-11-05T22:26:52Z</dc:date>
    </item>
    <item>
      <title>Hello Pat,</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Sandy-bridge-performance-degradation-compare-to-Westmere/m-p/941365#M1946</link>
      <description>Hello Pat,

"disable all the prefetchers " - what setting in the bios are you referring to ? 

these are the setting i found in the bios and there state

Numa optimized - Enabled
MLC streamer - Enabled
MLC spatial prefetcher - Enabled
DCU Data prefetcher - Enabled
DCU instruction prefetcher - Enabled

regrding the power saving (C1, C3 and C6) features all of them are disabled including turbo boost (see results with turbo boost above) .

can you share the reference latency numbers  ?

we tried to replace the memory with other memories ....
currently we are using  8GB X 8  (part number ACT8GHR72Q4H1600S  CL-11)  
we tried to replace it with the memory tix2 is using  4GB * 8 (part number 25L3205 CL-9) 

what is the lowest latency memory type  and memory setup we can use assuming we need at list 48GB of memory ?

amir</description>
      <pubDate>Tue, 06 Nov 2012 06:27:40 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Sandy-bridge-performance-degradation-compare-to-Westmere/m-p/941365#M1946</guid>
      <dc:creator>amk21</dc:creator>
      <dc:date>2012-11-06T06:27:40Z</dc:date>
    </item>
    <item>
      <title>lat_mem_rd results with spin</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Sandy-bridge-performance-degradation-compare-to-Westmere/m-p/941366#M1947</link>
      <description>lat_mem_rd results with spin loop running on second cpu

bios settings 

Numa optimized - Enabled
MLC streamer - Enabled
MLC spatial prefetcher - Enabled
DCU Data prefetcher - Enabled
DCU instruction prefetcher - Enabled

turbo boost - disabled 

numactl --cpunodebind=0 --membind=0 ./lat_mem_rd -t 1024

"stride=64
0.00049 1.383
0.00098 1.383
0.00195 1.383
0.00293 1.383
0.00391 1.383
0.00586 1.383
0.00781 1.383
0.01172 1.383
0.01562 1.383
0.02344 1.383
0.03125 1.383
0.04688 4.149
0.06250 4.149
0.09375 4.149
0.12500 4.149
0.18750 4.703
0.25000 5.405
0.37500 15.608
0.50000 15.790
0.75000 16.268
1.00000 17.013
1.50000 18.484
2.00000 20.319
3.00000 23.514
4.00000 25.144
6.00000 26.056
8.00000 26.881
12.00000 28.537
16.00000 32.171
24.00000 75.093
32.00000 89.267
48.00000 94.837
64.00000 95.840
96.00000 96.386
128.00000 95.148
192.00000 95.833
256.00000 95.996
384.00000 95.885
512.00000 95.866
768.00000 95.681
1024.00000 95.692

numactl --cpunodebind=0 --membind=1 ./lat_mem_rd -t 1024

"stride=64
0.00049 1.383
0.00098 1.383
0.00195 1.383
0.00293 1.383
0.00391 1.383
0.00586 1.383
0.00781 1.383
0.01172 1.383
0.01562 1.383
0.02344 1.383
0.03125 1.383
0.04688 4.149
0.06250 4.149
0.09375 4.654
0.12500 4.528
0.18750 4.906
0.25000 5.292
0.37500 15.496
0.50000 15.884
0.75000 16.188
1.00000 16.672
1.50000 18.687
2.00000 20.402
3.00000 23.885
4.00000 25.180
6.00000 26.213
8.00000 26.657
12.00000 28.135
16.00000 30.406
24.00000 100.014
32.00000 129.697
48.00000 139.414
64.00000 140.246
96.00000 141.176
128.00000 141.207
192.00000 140.858
256.00000 140.624
384.00000 140.085
512.00000 139.800
768.00000 139.485
1024.00000 139.227</description>
      <pubDate>Tue, 06 Nov 2012 06:39:26 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Sandy-bridge-performance-degradation-compare-to-Westmere/m-p/941366#M1947</guid>
      <dc:creator>amk21</dc:creator>
      <dc:date>2012-11-06T06:39:26Z</dc:date>
    </item>
    <item>
      <title>Thanks amk21,</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Sandy-bridge-performance-degradation-compare-to-Westmere/m-p/941367#M1948</link>
      <description>Thanks amk21,
When you say "with spin loop running on second cpu" do you mean that you are running the spin loop on one of the cpus on the 2nd processor?
That is, you are running the spin loop with something like "numactl --cpunodebind=1 --membind=1 ./spin" ?
I don't see any difference in the latency with the spin loop versus no spin loop so I'm wondering why.
Pat</description>
      <pubDate>Tue, 06 Nov 2012 18:24:04 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Sandy-bridge-performance-degradation-compare-to-Westmere/m-p/941367#M1948</guid>
      <dc:creator>Patrick_F_Intel1</dc:creator>
      <dc:date>2012-11-06T18:24:04Z</dc:date>
    </item>
    <item>
      <title>the spin loop is running on</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Sandy-bridge-performance-degradation-compare-to-Westmere/m-p/941368#M1949</link>
      <description>the spin loop is running on the second package with  "numactl --cpunodebind=1 --membind=1 ./spin"</description>
      <pubDate>Tue, 06 Nov 2012 18:30:30 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Sandy-bridge-performance-degradation-compare-to-Westmere/m-p/941368#M1949</guid>
      <dc:creator>amk21</dc:creator>
      <dc:date>2012-11-06T18:30:30Z</dc:date>
    </item>
    <item>
      <title>Ok... I'm going to have to</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Sandy-bridge-performance-degradation-compare-to-Westmere/m-p/941369#M1950</link>
      <description>Ok... I'm going to have to find a box and run on it myself.
I'll let you know what I find.
Pat</description>
      <pubDate>Tue, 06 Nov 2012 18:32:51 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Sandy-bridge-performance-degradation-compare-to-Westmere/m-p/941369#M1950</guid>
      <dc:creator>Patrick_F_Intel1</dc:creator>
      <dc:date>2012-11-06T18:32:51Z</dc:date>
    </item>
    <item>
      <title>Ok... I'm going to have to</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Sandy-bridge-performance-degradation-compare-to-Westmere/m-p/941370#M1951</link>
      <description>Ok... I'm going to have to find a box and run on it myself.
I'll let you know what I find.
Pat</description>
      <pubDate>Tue, 06 Nov 2012 18:32:51 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Sandy-bridge-performance-degradation-compare-to-Westmere/m-p/941370#M1951</guid>
      <dc:creator>Patrick_F_Intel1</dc:creator>
      <dc:date>2012-11-06T18:32:51Z</dc:date>
    </item>
    <item>
      <title>what is the best memory setup</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Sandy-bridge-performance-degradation-compare-to-Westmere/m-p/941371#M1952</link>
      <description>what is the best memory setup and type (cas latency) i should use for low latency if i need at list  48GB ?</description>
      <pubDate>Tue, 06 Nov 2012 18:38:08 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Sandy-bridge-performance-degradation-compare-to-Westmere/m-p/941371#M1952</guid>
      <dc:creator>amk21</dc:creator>
      <dc:date>2012-11-06T18:38:08Z</dc:date>
    </item>
  </channel>
</rss>

