<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic A follow up if anyone in Software Tuning, Performance Optimization &amp; Platform Monitoring</title>
    <link>https://community.intel.com/t5/Software-Tuning-Performance/New-Xeon-poorer-performance/m-p/1035480#M4389</link>
    <description>&lt;P&gt;A follow up if anyone encounters the same issue. &amp;nbsp;After much searching, it ended up being a BIOS setting from the manufacturer regarding the "snoop mode". &amp;nbsp;It was set on early snoop, and changing it to "cluster on die" sped things up. &amp;nbsp;The finite element comparison went from 15% slower than the previous V2, to being 25% faster. &amp;nbsp;From what I can tell, performance-wise "cluster on die" is the way to go, eventually seeing some benchmarks that compared these settings and found similar results.&lt;/P&gt;

&lt;P&gt;A side note though, I run a different code that isn't matrix heavy; a monte carlo particle transport code and this change made no difference there. &amp;nbsp;I'm still seeing the V3's being nearly 50% slower than the V2's, for the architectures I'm comparing which are&amp;nbsp;Intel(R) Xeon(R) CPU E5-2697 v3 vs&amp;nbsp;Intel(R) Xeon(R) CPU E5-2697 v2. &amp;nbsp;Trying to track down why, as I'm still better off buying the V2's unless I can find additional settings to at least match the V2 performance there. &amp;nbsp;Not sure if dropping to the 8-core architecture would have different results or not.&lt;/P&gt;</description>
    <pubDate>Fri, 12 Jun 2015 23:29:39 GMT</pubDate>
    <dc:creator>Jack_G_</dc:creator>
    <dc:date>2015-06-12T23:29:39Z</dc:date>
    <item>
      <title>New Xeon, poorer performance</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/New-Xeon-poorer-performance/m-p/1035479#M4388</link>
      <description>&lt;P&gt;In reading several scientific computing benchmarks of the E5-2697 v3 vs the E5-2697 v2, I got the impression the v3's should perform better, although they were 0.1 GHz slower. &amp;nbsp;I'm getting funny results on a heterogeneous cluster I'm running on. &amp;nbsp;Centos&amp;nbsp;2.6.32-504.el6.x86_64.&lt;/P&gt;

&lt;P&gt;Basically, the E5-2697 v2's are clearly outperforming the v3 counterparts (~15% faster. &amp;nbsp;I'm running a finite element code on them, compiled against intel compiler products 15.0.2 (ifort, icc, icpc etc...)). &amp;nbsp;The timing I get either in parallel within a node, or serial on each node shows results on the v3's that are much slower than what I expected. &amp;nbsp;I ran a calculation on each of the 4 different types of nodes I have on the cluster, all named "tebowXXX":&lt;/P&gt;

&lt;TABLE border="0" cellpadding="0" cellspacing="0" style="width:495px;" width="494"&gt;
	&lt;COLGROUP&gt;
		&lt;COL /&gt;
		&lt;COL /&gt;&lt;/COLGROUP&gt;
	&lt;TBODY&gt;
		&lt;TR height="20"&gt;
			&lt;TD height="20" style="height:20px;width:71px;"&gt;Tebow135&lt;/TD&gt;
			&lt;TD style="width:424px;"&gt;&amp;nbsp;Intel(R) Xeon(R) CPU E5-2697 v3 @ 2.60GHz&amp;nbsp;&lt;/TD&gt;
		&lt;/TR&gt;
		&lt;TR height="21"&gt;
			&lt;TD height="21" style="height:21px;"&gt;Tebow123&lt;/TD&gt;
			&lt;TD&gt;Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz&lt;/TD&gt;
		&lt;/TR&gt;
		&lt;TR height="21"&gt;
			&lt;TD height="21" style="height:21px;"&gt;Tebow117&lt;/TD&gt;
			&lt;TD&gt;Intel(R) Xeon(R) CPU E5-2687W 0 @ 3.10GHz&lt;/TD&gt;
		&lt;/TR&gt;
		&lt;TR height="20"&gt;
			&lt;TD height="20" style="height:20px;"&gt;Tebow101&lt;/TD&gt;
			&lt;TD&gt;Intel(R) Core(TM) i7-2600K CPU @ 3.40GHz, overclocked to 4.2 GHz&lt;/TD&gt;
		&lt;/TR&gt;
	&lt;/TBODY&gt;
&lt;/TABLE&gt;

&lt;P&gt;The results can be found on sheet 1 of the attached Excel file, however I also want to go through what I have verified. &amp;nbsp;There is no thermal throttling going on (I do a check on the core_throttle_count to make sure). &amp;nbsp;I also checked to make sure there was OS throttling through the kondemand process or anything similar, but the nodes all report&amp;nbsp;running at the stock speeds (the overclocked node still reports 3.4 GHz, but I know I've boosted it up). &amp;nbsp;I checked memory &lt;SPAN style="font-size: 13.0080003738403px; line-height: 19.5120010375977px;"&gt;info&amp;nbsp;&lt;/SPAN&gt;(seen in the attachment) between tebow135 and tebow123, as these are the bulk of our cluster and the speed difference between them&amp;nbsp;is enough to preclude utilizing them effectively together, not to mention that&amp;nbsp;I got newer nodes, I want them running faster if possible. &amp;nbsp;T&lt;SPAN style="font-size: 1em; line-height: 1.5;"&gt;he code was compiled on the headnode which is identical to tebow117, (E5-2687W 0 @ 3.10 GHz). &amp;nbsp;I wasn't sure the best way to compare the parallel runs which used more or less CPUs than the control, so the raw data is there to also look at also. &amp;nbsp;Basically I tried to do percentage comparison of each of the 4 node types; in parallel using 28 &amp;amp; 24 CPUs on the v3, 24 CPUs on the v2, &amp;nbsp;and 16 CPUs on the 2687W, in a way that was meaningful. &amp;nbsp;Additionally I did serial runs on all of these also to take MPI (running OpenMPI 1.8.4) out of the equation, and there was seeing ~15% slower runs between the v2 and v3. &amp;nbsp;The 2687W scored well in the serial also, but I believe it was most likely then running in turbo boost mode, thus the 3.1 is not as meaningful (3.8 I think). &amp;nbsp;I don't know if the parallel comparison is meaningful or not the way I crunched it, a "Seconds/processor/GHz" scale. &amp;nbsp;Use it with caution, the serial comparisons are probably the best. &amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 1em; line-height: 1.5;"&gt;So after all this can anyone help me troubleshoot why I'm getting such bad performance and if it is expected? &amp;nbsp;As I mentioned the benchmarks I read didn't see this. &amp;nbsp;Could it be compiler issues, Linux configuration issues, Infiniband issues (I ran serial as well as parallel dedicated on each node so I thought I would have minimized communication differences, although the file system is shared as NFSoRDMA), or something else I can't think of? &amp;nbsp;Any thoughts or troubleshooting help is welcome.&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;Thanks, Jack&lt;/P&gt;</description>
      <pubDate>Thu, 11 Jun 2015 00:00:43 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/New-Xeon-poorer-performance/m-p/1035479#M4388</guid>
      <dc:creator>Jack_G_</dc:creator>
      <dc:date>2015-06-11T00:00:43Z</dc:date>
    </item>
    <item>
      <title>A follow up if anyone</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/New-Xeon-poorer-performance/m-p/1035480#M4389</link>
      <description>&lt;P&gt;A follow up if anyone encounters the same issue. &amp;nbsp;After much searching, it ended up being a BIOS setting from the manufacturer regarding the "snoop mode". &amp;nbsp;It was set on early snoop, and changing it to "cluster on die" sped things up. &amp;nbsp;The finite element comparison went from 15% slower than the previous V2, to being 25% faster. &amp;nbsp;From what I can tell, performance-wise "cluster on die" is the way to go, eventually seeing some benchmarks that compared these settings and found similar results.&lt;/P&gt;

&lt;P&gt;A side note though, I run a different code that isn't matrix heavy; a monte carlo particle transport code and this change made no difference there. &amp;nbsp;I'm still seeing the V3's being nearly 50% slower than the V2's, for the architectures I'm comparing which are&amp;nbsp;Intel(R) Xeon(R) CPU E5-2697 v3 vs&amp;nbsp;Intel(R) Xeon(R) CPU E5-2697 v2. &amp;nbsp;Trying to track down why, as I'm still better off buying the V2's unless I can find additional settings to at least match the V2 performance there. &amp;nbsp;Not sure if dropping to the 8-core architecture would have different results or not.&lt;/P&gt;</description>
      <pubDate>Fri, 12 Jun 2015 23:29:39 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/New-Xeon-poorer-performance/m-p/1035480#M4389</guid>
      <dc:creator>Jack_G_</dc:creator>
      <dc:date>2015-06-12T23:29:39Z</dc:date>
    </item>
  </channel>
</rss>

