<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic poor code generation; store-forward stall in Software Archive</title>
    <link>https://community.intel.com/t5/Software-Archive/poor-code-generation-store-forward-stall/m-p/953689#M20215</link>
    <description>I am investigating some cases where the code generated by CVF6.5A is particularly slow on P4 "NetBurst."  I note that it is particularly important to set /architecture:p6, and performance is excellent in many situations.  One of the worst situations is where fnstcw (a 16-bit store) is always followed by a 32-bit load but only the 16 bits are used, and modifying all instances of that load instruction may more than double performance. &lt;BR /&gt; &lt;BR /&gt;Are these obstacles to P4 performance already under review; would bug reports be appropriate?  I didn't find anything by searching the forum, but the forum response is extremely slow on my home ISP. &lt;BR /&gt; &lt;BR /&gt;Where the NetBurst parallel instructions are needed to achieve the potential of P4, a new architecture switch would be required.  Is there any interest in this?</description>
    <pubDate>Fri, 08 Jun 2001 00:12:48 GMT</pubDate>
    <dc:creator>TimP</dc:creator>
    <dc:date>2001-06-08T00:12:48Z</dc:date>
    <item>
      <title>poor code generation; store-forward stall</title>
      <link>https://community.intel.com/t5/Software-Archive/poor-code-generation-store-forward-stall/m-p/953689#M20215</link>
      <description>I am investigating some cases where the code generated by CVF6.5A is particularly slow on P4 "NetBurst."  I note that it is particularly important to set /architecture:p6, and performance is excellent in many situations.  One of the worst situations is where fnstcw (a 16-bit store) is always followed by a 32-bit load but only the 16 bits are used, and modifying all instances of that load instruction may more than double performance. &lt;BR /&gt; &lt;BR /&gt;Are these obstacles to P4 performance already under review; would bug reports be appropriate?  I didn't find anything by searching the forum, but the forum response is extremely slow on my home ISP. &lt;BR /&gt; &lt;BR /&gt;Where the NetBurst parallel instructions are needed to achieve the potential of P4, a new architecture switch would be required.  Is there any interest in this?</description>
      <pubDate>Fri, 08 Jun 2001 00:12:48 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/poor-code-generation-store-forward-stall/m-p/953689#M20215</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2001-06-08T00:12:48Z</dc:date>
    </item>
    <item>
      <title>Re: poor code generation; store-forward stall</title>
      <link>https://community.intel.com/t5/Software-Archive/poor-code-generation-store-forward-stall/m-p/953690#M20216</link>
      <description>We're always interested in specific examples of places where we can generate better code - though I think the one you describe is one we already know about.  Please send a short example, if you can, and a description of how you think the code should be improved, to us at vf-support@compaq.com  We've received a number of examples from folks at AMD - we'd welcome them from Intel as well.&lt;BR /&gt;&lt;BR /&gt;In any event - the next update to CVF will include a P4 architecture switch to specify that the processor is a P4 so that we generate appropriate instructions for it.  We have found that Pentium III assumptions don't hold for the P4, which is why you have to say /arch:P6 in CVF 6.5.&lt;BR /&gt;&lt;BR /&gt;I will say, though, that we've found the P4 to be an uneven performer, even using Intel's compiler (which is sometimes better, sometimes worse than CVF on a P4).  We have some benchmark programs where a 1.4GHz P4 performs worse than an 850MHz PIII.  A 1.1GHz AMD Athlon usually outperforms the 1.4GHz P4 across the board.  The P4 is REALLY good at memory-bandwidth-intensive programs, though.&lt;BR /&gt;&lt;BR /&gt;It's not clear to us that generating SSE2 instructions is the key to "achieve the potential of P4" - Intel's own published papers say that SSE2 gained only 5% on the SPEC benchmark tests.  Nevertheless, we are quite interested in seeing what is available to boost performance on each of our supported processors - so feel free to send us specific suggestions.&lt;BR /&gt;&lt;BR /&gt;Steve</description>
      <pubDate>Fri, 08 Jun 2001 02:04:48 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/poor-code-generation-store-forward-stall/m-p/953690#M20216</guid>
      <dc:creator>Steven_L_Intel1</dc:creator>
      <dc:date>2001-06-08T02:04:48Z</dc:date>
    </item>
    <item>
      <title>Re: poor code generation; store-forward stall</title>
      <link>https://community.intel.com/t5/Software-Archive/poor-code-generation-store-forward-stall/m-p/953691#M20217</link>
      <description>CVF frequently out-performs the P-II-compatible code generated by the Intel compiler, and the Intel compiler sometimes chooses SSE code, when that is enabled, when it is slower than P-II compatible code.  The few cases where the Intel compiler generates SSE code which is clearly faster than CVF generic code involve either vectorization, which may produce a 90% improvement, storing real to integer (which would be helped a great deal by fixing the issue raised above), and math functions, where the internal firmware is not the best choice.</description>
      <pubDate>Fri, 08 Jun 2001 02:34:53 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/poor-code-generation-store-forward-stall/m-p/953691#M20217</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2001-06-08T02:34:53Z</dc:date>
    </item>
  </channel>
</rss>

