<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Thanks Sergey. in Intel® ISA Extensions</title>
    <link>https://community.intel.com/t5/Intel-ISA-Extensions/Performance-of-sqrt/m-p/935730#M3600</link>
    <description>&amp;gt;&amp;gt;&amp;gt;SSE Sqrt - RTfloat
 Calculating the Square Root of 625.000 - 47 ticks
 625.000^0.5 = 25.000&amp;gt;&amp;gt;&amp;gt;

It is interesting which of the sqrt calculation methods does hardware accelerated SSE instruction use?</description>
    <pubDate>Tue, 05 Feb 2013 06:15:00 GMT</pubDate>
    <dc:creator>Bernard</dc:creator>
    <dc:date>2013-02-05T06:15:00Z</dc:date>
    <item>
      <title>Performance of sqrt</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Performance-of-sqrt/m-p/935692#M3562</link>
      <description>&lt;P&gt;Hello,&lt;/P&gt;
&lt;P&gt;I am using the intrinsic for square root. I know from the Optimization manual I could use reciprocal square root and aproximation algorithm. But I need the accuracy.&lt;/P&gt;
&lt;P&gt;The thing is that AVX shows no improvement over SSE. Intrinsics guide gave me some hints. Is it true that the square root operation is not pipeling for both SSE and AVX? At least latency and througput indicte this. I mean AVX has twice data amount per operation but a double of latency and half of througput means all combined same performance? Is it so?&lt;/P&gt;
&lt;P&gt;My testsystem is an i5-2410M. In the intrinsics guide (I updated to the newest version) I only find latency and througput for Sandy Bridge. Has performance of this commands improved in Ivy Bridge? Could anyone explain the CPUID(s) a little bit? 06_2A means Sandy Bridge or does it not? Does this account for all Sandy Bridge CPUs (regardless of Desktop or Mobile or i3, i5, i7)?&lt;/P&gt;
&lt;P&gt;For CPUID(s) I found: &lt;A href="http://software.intel.com/en-us/articles/intel-architecture-and-processor-identification-with-cpuid-model-and-family-numbers" target="_blank"&gt;http://software.intel.com/en-us/articles/intel-architecture-and-processor-identification-with-cpuid-model-and-family-numbers&lt;/A&gt;&lt;/P&gt;
&lt;P&gt;Does the intrinsics guide refer to a combination of family and model number? What about model numbers not mentioned in the intrinsics guide like Ivy Bridge?&lt;/P&gt;</description>
      <pubDate>Fri, 01 Feb 2013 11:35:31 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Performance-of-sqrt/m-p/935692#M3562</guid>
      <dc:creator>Christian_M_2</dc:creator>
      <dc:date>2013-02-01T11:35:31Z</dc:date>
    </item>
    <item>
      <title> &gt;&gt;&gt;Could anyone explain the</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Performance-of-sqrt/m-p/935693#M3563</link>
      <description>&lt;P&gt;&amp;nbsp;&amp;gt;&amp;gt;&amp;gt;Could anyone explain the CPUID(s) a little bit? 06_2A means Sandy Bridge or does it &amp;gt;&amp;gt;&amp;gt;&lt;/P&gt;
&lt;P&gt;That means 32 nm&amp;nbsp;Sandy Bridge microarchitecture.&lt;/P&gt;
&lt;P&gt;Please look at this link which is more related to the speed of execution(comparision between SSE sqrt(x) and invsqrt multiplied by x)&lt;/P&gt;
&lt;P&gt;&lt;A href="http://stackoverflow.com/questions/1528727/why-is-sse-scalar-sqrtx-slower-than-rsqrtx-x"&gt;http://stackoverflow.com/questions/1528727/why-is-sse-scalar-sqrtx-slower-than-rsqrtx-x&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 01 Feb 2013 16:19:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Performance-of-sqrt/m-p/935693#M3563</guid>
      <dc:creator>Bernard</dc:creator>
      <dc:date>2013-02-01T16:19:00Z</dc:date>
    </item>
    <item>
      <title>Quote:iliyapolak wrote:</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Performance-of-sqrt/m-p/935694#M3564</link>
      <description>&lt;P&gt;&lt;/P&gt;&lt;BLOCKQUOTE&gt;iliyapolak wrote:&lt;BR /&gt;&lt;P&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&amp;gt;&amp;gt;&amp;gt;Could anyone explain the CPUID(s) a little bit? 06_2A means Sandy Bridge or does it &amp;gt;&amp;gt;&amp;gt;&lt;/P&gt;
&lt;P&gt;That means 32 nm&amp;nbsp;Sandy Bridge microarchitecture.&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;&lt;/P&gt;
&lt;P&gt;This brings me already closer.&lt;/P&gt;
&lt;P&gt;But what about Ivy Bridge and other unmentioned CPUID(s). Does anybody have some tips?&lt;/P&gt;</description>
      <pubDate>Fri, 01 Feb 2013 16:25:56 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Performance-of-sqrt/m-p/935694#M3564</guid>
      <dc:creator>Christian_M_2</dc:creator>
      <dc:date>2013-02-01T16:25:56Z</dc:date>
    </item>
    <item>
      <title>&gt;&gt;&gt;The thing is that AVX</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Performance-of-sqrt/m-p/935695#M3565</link>
      <description>&lt;P&gt;&amp;gt;&amp;gt;&amp;gt;The thing is that AVX shows no improvement over SSE&amp;gt;&amp;gt;&amp;gt;&lt;/P&gt;
&lt;P&gt;Maybe exact the&amp;nbsp;microcode implementation of the sqrt algorithm is the same when AVX and SSE instruction are compared.&lt;/P&gt;</description>
      <pubDate>Fri, 01 Feb 2013 16:38:04 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Performance-of-sqrt/m-p/935695#M3565</guid>
      <dc:creator>Bernard</dc:creator>
      <dc:date>2013-02-01T16:38:04Z</dc:date>
    </item>
    <item>
      <title>I think AVX sqrt</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Performance-of-sqrt/m-p/935696#M3566</link>
      <description>&lt;P&gt;I think AVX sqrt implementation only calls SSE implementation for lower and upper YMM register. As latency is doubled for double data amount.&lt;/P&gt;
&lt;P&gt;But I am not sure, whether this is for all Sandy Bridge or only because I test on a middle class Sandy Bridge for mobile notebooks.&lt;/P&gt;</description>
      <pubDate>Fri, 01 Feb 2013 16:59:49 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Performance-of-sqrt/m-p/935696#M3566</guid>
      <dc:creator>Christian_M_2</dc:creator>
      <dc:date>2013-02-01T16:59:49Z</dc:date>
    </item>
    <item>
      <title>As Christian hinted, the</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Performance-of-sqrt/m-p/935697#M3567</link>
      <description>&lt;P&gt;As Christian hinted, the hardware implementation of IEEE divide and sqrt on Sandy and Ivy Bridge sequences the operands into AVX-128 pairs, so it's likely there is little performance gain for AVX-256 vs. SSE/SSE2 or AVX-128.&amp;nbsp; Ivy Bridge greatly reduces the latency so there may not be an incentive to employ the reciprocal and iteration option. That is termed a throughput optimization for single/float precision, in that it opens up opportunity for instruction level parallelism in a loop which has significant work in instructions other than divide/sqrt.&lt;/P&gt;</description>
      <pubDate>Fri, 01 Feb 2013 17:34:09 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Performance-of-sqrt/m-p/935697#M3567</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2013-02-01T17:34:09Z</dc:date>
    </item>
    <item>
      <title>@Tim</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Performance-of-sqrt/m-p/935698#M3568</link>
      <description>&lt;P&gt;@Tim&lt;/P&gt;
&lt;P&gt;Is it possible to obtain an information about the exact algorithm used to calculate sqrt values on Intel CPU's?&lt;/P&gt;</description>
      <pubDate>Sat, 02 Feb 2013 06:15:39 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Performance-of-sqrt/m-p/935698#M3568</guid>
      <dc:creator>Bernard</dc:creator>
      <dc:date>2013-02-02T06:15:39Z</dc:date>
    </item>
    <item>
      <title>Christian,</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Performance-of-sqrt/m-p/935699#M3569</link>
      <description>Christian,

Let me know if you need real performance numbers for different &lt;STRONG&gt;sqrt&lt;/STRONG&gt; functions and floating-point types ( 6 tests in total / 5 different C++ compilers ). I can do it for &lt;STRONG&gt;Intel Core i7-3840QM&lt;/STRONG&gt; ( Ivy Bridge / 4 cores ) and older CPUs, for example &lt;STRONG&gt;Intel Pentium 4&lt;/STRONG&gt;.</description>
      <pubDate>Sat, 02 Feb 2013 06:34:39 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Performance-of-sqrt/m-p/935699#M3569</guid>
      <dc:creator>SergeyKostrov</dc:creator>
      <dc:date>2013-02-02T06:34:39Z</dc:date>
    </item>
    <item>
      <title>&gt;&gt;...As latency is doubled</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Performance-of-sqrt/m-p/935700#M3570</link>
      <description>&amp;gt;&amp;gt;...As latency is doubled for double data amount...

In SSE performance numbers are almost the same for the following test cases:

&lt;STRONG&gt;[ Test-case 1 ]&lt;/STRONG&gt;

RTfloat fA = 625.0f;
mmValue.m128_f32[0] = ( RTfloat )fA;
mmValue.m128_f32[1] = ( RTfloat )0.0f; // since this is 0.0 sqrt shouldn't be calculated
mmValue.m128_f32[2] = ( RTfloat )0.0f; // since this is 0.0 sqrt shouldn't be calculated
mmValue.m128_f32[3] = ( RTfloat )0.0f; // since this is 0.0 sqrt shouldn't be calculated
mmResult = _mm_sqrt_ps( mmValue );

&lt;STRONG&gt;[ Test-case 2 ]&lt;/STRONG&gt;

RTfloat fA = 625.0f;
mmValue.m128_f32[0] = ( RTfloat )fA;
mmValue.m128_f32[1] = ( RTfloat )fA;
mmValue.m128_f32[2] = ( RTfloat )fA;
mmValue.m128_f32[3] = ( RTfloat )fA;
mmResult = _mm_sqrt_ps( mmValue );</description>
      <pubDate>Sat, 02 Feb 2013 06:44:11 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Performance-of-sqrt/m-p/935700#M3570</guid>
      <dc:creator>SergeyKostrov</dc:creator>
      <dc:date>2013-02-02T06:44:11Z</dc:date>
    </item>
    <item>
      <title>&gt;&gt;...06_2A means Sandy Bridge</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Performance-of-sqrt/m-p/935701#M3571</link>
      <description>&amp;gt;&amp;gt;...06_2A means Sandy Bridge or does it not?..

I'll take a look.

In general, you need to get more detailed information like:
...
CPU Brand String:          Intel(R) Atom(TM) CPU N270   @ 1.60GHz
CPU Vendor      : GenuineIntel
	Stepping ID = 2
	Model = 12
	Family = 6
	Extended Model = 1
...
and then to "map" these numbers to codes in the manual.</description>
      <pubDate>Sat, 02 Feb 2013 06:52:17 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Performance-of-sqrt/m-p/935701#M3571</guid>
      <dc:creator>SergeyKostrov</dc:creator>
      <dc:date>2013-02-02T06:52:17Z</dc:date>
    </item>
    <item>
      <title>Quote:TimP (Intel) wrote:</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Performance-of-sqrt/m-p/935702#M3572</link>
      <description>&lt;P&gt;&lt;/P&gt;&lt;BLOCKQUOTE&gt;TimP (Intel) wrote:&lt;BR /&gt;&lt;P&gt;&lt;/P&gt;
&lt;P&gt;As Christian hinted, the hardware implementation of IEEE divide and sqrt on Sandy and Ivy Bridge sequences the operands into AVX-128 pairs, so it's likely there is little performance gain for AVX-256 vs. SSE/SSE2 or AVX-128.&amp;nbsp; Ivy Bridge greatly reduces the latency so there may not be an incentive to employ the reciprocal and iteration option. That is termed a throughput optimization for single/float precision, in that it opens up opportunity for instruction level parallelism in a loop which has significant work in instructions other than divide/sqrt.&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;&lt;/P&gt;
&lt;P&gt;The thing with Ivy Bridge is really interesting. With add and mul Sandy Bridge already allows quite good instruction level parallelism. If one result is not directly base on the operations before, one can fill the pipeline very well and get a result per clock nearly, I suppose.&lt;/P&gt;
&lt;P&gt;Can one find the optimizations of Ivy Bridge also in the Intrinsics guide? I do not find the appropriate CPUID. If 06_2A is Sandy Bridge, then according to the table from &lt;A href="http://software.intel.com/en-us/articles/intel-architecture-and-processor-identification-with-cpuid-model-and-family-numbers" target="_blank"&gt;http://software.intel.com/en-us/articles/intel-architecture-and-processor-identification-with-cpuid-model-and-family-numbers&lt;/A&gt;, Ivy Bridge should have 06_3A. But I can't find it in the Intrinsics guide for any instructions (I have not checked every but those that are imporant for me).&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;&lt;BLOCKQUOTE&gt;Sergey Kostrov wrote:&lt;BR /&gt;&lt;P&gt;&lt;/P&gt;
&lt;P&gt;Christian,&lt;/P&gt;
&lt;P&gt;Let me know if you need real performance numbers for different &lt;STRONG&gt;sqrt&lt;/STRONG&gt; functions and floating-point types ( 6 tests in total / 5 different C++ compilers ). I can do it for &lt;STRONG&gt;Intel Core i7-3840QM&lt;/STRONG&gt; ( Ivy Bridge / 4 cores ) and older CPUs, for example &lt;STRONG&gt;Intel Pentium 4&lt;/STRONG&gt;.&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;&lt;/P&gt;
&lt;P&gt;This would be great! I am especially interested on the performance of the precise square root operation. Different CPUs would be a good indicator. I wounder whether the results also differ within a CPU family.&lt;/P&gt;
&lt;P&gt;You mentioned that I should "map these numbers to codes in the manual. Which manual are you talking about exactly?&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Sat, 02 Feb 2013 16:59:11 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Performance-of-sqrt/m-p/935702#M3572</guid>
      <dc:creator>Christian_M_2</dc:creator>
      <dc:date>2013-02-02T16:59:11Z</dc:date>
    </item>
    <item>
      <title>&gt;&gt;... Which manual are you</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Performance-of-sqrt/m-p/935703#M3573</link>
      <description>&amp;gt;&amp;gt;... Which manual are you talking about exactly?

Please take a look at: &lt;A href="http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html" target="_blank"&gt;http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html&lt;/A&gt;</description>
      <pubDate>Sat, 02 Feb 2013 18:49:26 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Performance-of-sqrt/m-p/935703#M3573</guid>
      <dc:creator>SergeyKostrov</dc:creator>
      <dc:date>2013-02-02T18:49:26Z</dc:date>
    </item>
    <item>
      <title>Here are a couple of more</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Performance-of-sqrt/m-p/935704#M3574</link>
      <description>Here are a couple of more links &amp;amp; tips:

- You need to look at &lt;STRONG&gt;Intel 64 and IA-32 Architectures Optimization Reference Manual&lt;/STRONG&gt;, &lt;STRONG&gt;APPENDIX C&lt;/STRONG&gt;, INSTRUCTION LATENCY AND THROUGHPUT

- Try to use &lt;STRONG&gt;msinfo32.exe&lt;/STRONG&gt; utility ( it provides some CPU information )

- http://ark.intel.com -&amp;gt; &lt;A href="http://ark.intel.com/products/52224/Intel-Core-i5-2410M-Processor-3M-Cache-up-to-2_90-GHz?q=i5-2410M" target="_blank"&gt;http://ark.intel.com/products/52224/Intel-Core-i5-2410M-Processor-3M-Cache-up-to-2_90-GHz?q=i5-2410M&lt;/A&gt;

Note: Take a look at a &lt;STRONG&gt;datasheet&lt;/STRONG&gt; for your i5-2410M CPU in a &lt;STRONG&gt;Quick Links&lt;/STRONG&gt; section ( on the right side of the web page )

- &lt;A href="http://software.intel.com/en-us/forums/topic/278742" target="_blank"&gt;http://software.intel.com/en-us/forums/topic/278742&lt;/A&gt;</description>
      <pubDate>Sat, 02 Feb 2013 19:10:03 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Performance-of-sqrt/m-p/935704#M3574</guid>
      <dc:creator>SergeyKostrov</dc:creator>
      <dc:date>2013-02-02T19:10:03Z</dc:date>
    </item>
    <item>
      <title>&gt;&gt;&gt;I am especially interested</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Performance-of-sqrt/m-p/935705#M3575</link>
      <description>&lt;P&gt;&amp;gt;&amp;gt;&amp;gt;I am especially interested on the performance of the precise square root operation&amp;gt;&amp;gt;&amp;gt;&lt;/P&gt;
&lt;P&gt;Here you have a very interesting discussion about the hardware accelereted sqrt calculation&lt;/P&gt;
&lt;P&gt;&lt;A href="http://stackoverflow.com/questions/1528727/why-is-sse-scalar-sqrtx-slower-than-rsqrtx-x" target="_blank"&gt;http://stackoverflow.com/questions/1528727/why-is-sse-scalar-sqrtx-slower-than-rsqrtx-x&lt;/A&gt;&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Sun, 03 Feb 2013 06:24:43 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Performance-of-sqrt/m-p/935705#M3575</guid>
      <dc:creator>Bernard</dc:creator>
      <dc:date>2013-02-03T06:24:43Z</dc:date>
    </item>
    <item>
      <title>&gt;&gt;&gt;I am especially interested</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Performance-of-sqrt/m-p/935706#M3576</link>
      <description>&lt;P&gt;&amp;gt;&amp;gt;&amp;gt;I am especially interested on the performance of the precise square root operation.&amp;gt;&amp;gt;&amp;gt;&lt;/P&gt;
&lt;P&gt;Follow this link : &lt;A href="http://stackoverflow.com/questions/1528727/why-is-sse-scalar-sqrtx-slower-than-rsqrtx-x" target="_blank"&gt;http://stackoverflow.com/questions/1528727/why-is-sse-scalar-sqrtx-slower-than-rsqrtx-x&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Sun, 03 Feb 2013 07:19:33 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Performance-of-sqrt/m-p/935706#M3576</guid>
      <dc:creator>Bernard</dc:creator>
      <dc:date>2013-02-03T07:19:33Z</dc:date>
    </item>
    <item>
      <title>  &gt;&gt;&gt;With add and mul Sandy</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Performance-of-sqrt/m-p/935707#M3577</link>
      <description>&lt;P&gt;&amp;nbsp; &amp;gt;&amp;gt;&amp;gt;With add and mul Sandy Bridge already allows quite good instruction level parallelism&amp;gt;&amp;gt;&amp;gt;&lt;/P&gt;
&lt;P&gt;Sandy Bridge really improved instruction level parallelism by adding one or two new ports to the execution cluster.So for example when your code has fp add(one vector addition) and fp mul(one vector multiplication) both without beign interdependent on each other they can be executed simultaneously.&lt;/P&gt;</description>
      <pubDate>Sun, 03 Feb 2013 07:38:42 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Performance-of-sqrt/m-p/935707#M3577</guid>
      <dc:creator>Bernard</dc:creator>
      <dc:date>2013-02-03T07:38:42Z</dc:date>
    </item>
    <item>
      <title>"&gt;&gt;&gt;I am especially</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Performance-of-sqrt/m-p/935708#M3578</link>
      <description>&lt;P&gt;"&amp;gt;&amp;gt;&amp;gt;I am especially interested on the performance of the precise square root operation.&amp;gt;&amp;gt;&amp;gt;&lt;/P&gt;
&lt;P&gt;Follow this link : &lt;A href="http://stackoverflow.com/questions/1528727/why-is-sse-scalar-sqrtx-slower-than-rsqrtx-x"&gt;http://stackoverflow.com/questions/1528727/why-is-sse-scalar-sqrtx-slowe...&lt;/A&gt;"&lt;/P&gt;
&lt;P&gt;These imprecise operations are available via Intel compiler options&lt;/P&gt;
&lt;P&gt;/Qimf-accuracy-bits:bits[:funclist]&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; define the relative error, measured by the number of correct bits,&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; for math library function results&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; bits&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; - a positive, floating-point number&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; funclist - optional comma separated list of one or more math&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; library functions to which the attribute should be&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; applied&lt;/P&gt;
&lt;P&gt;So you can request the 13-bit accuracy implementation of divide and sqrt. Iterative methods with less than full precision can be produced by requesting 20- 40- or 49-bit accuracy.&amp;nbsp; 22-bit accuracy is the default for single precision vectorization; -Qprec-div -Qprec-sqrt (implied by /fp:source|precise) changes default to 24/53-bit accuracy.&amp;nbsp; Beginning with Harpertown, the IEEE instructions, referred to as "native" in your references, have been quite competitive for SSE/SSE2.&amp;nbsp;&amp;nbsp; Original core 2 duo with the slower divide and sqrt is no longer in production.&amp;nbsp; I turned mine in after 4.5 years rather than re-install WIndows a 4th time.&lt;/P&gt;
&lt;P&gt;The x87 divide and sqrt also support a trade-off between speed and precision, by setting 24-, 53- (default for Intel and Microsoft compilers) or 64- (hardware default, /Qpc80) bit precision mode.&lt;/P&gt;
&lt;P&gt;You also have the choice, since SSE, of gradual underflow (/Qftz-) to maintain precision in the presence of partial underflow.&amp;nbsp; Sandy Bridge removes the performance penalty for /Qftz- in most common situations.&amp;nbsp; This was done in part because it's not convenient to set abrupt underflow when using Microsoft or gnu compilers.&lt;/P&gt;
&lt;P&gt;All these options are more than most developers are willing to bargain for (and QA test).&amp;nbsp; That's one of the reasons for availability of IEEE standard compliant instructions and for progress at the hardware level in making them more efficient.&lt;/P&gt;</description>
      <pubDate>Sun, 03 Feb 2013 12:32:16 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Performance-of-sqrt/m-p/935708#M3578</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2013-02-03T12:32:16Z</dc:date>
    </item>
    <item>
      <title>&gt;&gt;&gt;Follow this link : http:/</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Performance-of-sqrt/m-p/935709#M3579</link>
      <description>&lt;P&gt;&amp;gt;&amp;gt;&amp;gt;Follow this link : &lt;A href="http://stackoverflow.com/questions/1528727/why-is-sse-scalar-sqrtx-slower-than-rsqrtx-x"&gt;http://stackoverflow.com/questions/1528727/why-is-sse-scalar-sqrtx-slowe...&lt;/A&gt;&lt;/P&gt;
&lt;P&gt;Quite interesting discussion, it provides a lot of information.&lt;/P&gt;
&lt;P&gt;I found the following discussion about square root and AVX: &lt;A href="http://stackoverflow.com/questions/8924729/using-avx-intrinsics-instead-of-sse-does-not-improve-speed-why" target="_blank"&gt;http://stackoverflow.com/questions/8924729/using-avx-intrinsics-instead-of-sse-does-not-improve-speed-why&lt;/A&gt;&lt;/P&gt;
&lt;P&gt;One is mentioning something about instruction emulation. Is it true that low end processor (lets take an i3 Sandy Bridge) has other execution units or less than an i7 Sandy Bridge?&lt;/P&gt;</description>
      <pubDate>Sun, 03 Feb 2013 14:04:12 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Performance-of-sqrt/m-p/935709#M3579</guid>
      <dc:creator>Christian_M_2</dc:creator>
      <dc:date>2013-02-03T14:04:12Z</dc:date>
    </item>
    <item>
      <title>&gt;&gt;&gt;These imprecise operations</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Performance-of-sqrt/m-p/935710#M3580</link>
      <description>&lt;P&gt;&amp;gt;&amp;gt;&amp;gt;These imprecise operations are available via Intel compiler options ...&lt;/P&gt;
&lt;P&gt;Wow, this information is quite new. I did not know one could control accuracy.&lt;/P&gt;
&lt;P&gt;&amp;gt;&amp;gt;&amp;gt; Beginning with Harpertown, the IEEE instructions, referred to as "native" in your references, have been quite competitive for SSE/SSE2.&lt;/P&gt;
&lt;P&gt;So this was the first time IEEE compliant instructions provided quite good speed compared to other SSE/SSE2 versions?&lt;/P&gt;
&lt;P&gt;And to x87: I found that some compilers use only x87 FPU in 32 bit mode and switching same code to compile for 64 bit mode, SSE is used (only scalar version). Is this also something can be controlled? For some algorithms high accuracy might be useful. x87 fpu provides most precision with 80 bit. This can not be achieved with SSE any more.&lt;/P&gt;</description>
      <pubDate>Sun, 03 Feb 2013 14:25:28 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Performance-of-sqrt/m-p/935710#M3580</guid>
      <dc:creator>Christian_M_2</dc:creator>
      <dc:date>2013-02-03T14:25:28Z</dc:date>
    </item>
    <item>
      <title>&gt;&gt;&gt;For some algorithms high</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Performance-of-sqrt/m-p/935711#M3581</link>
      <description>&lt;P&gt;&amp;gt;&amp;gt;&amp;gt;For some algorithms high accuracy might be useful. x87 fpu provides most precision with 80 bit. This can not be achieved with SSE any more.&amp;gt;&amp;gt;&amp;gt;&lt;/P&gt;
&lt;P&gt;Yes because this is the developer's decision and/or project constraints to favor precision over vectorization of the code.&lt;/P&gt;</description>
      <pubDate>Sun, 03 Feb 2013 15:20:46 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Performance-of-sqrt/m-p/935711#M3581</guid>
      <dc:creator>Bernard</dc:creator>
      <dc:date>2013-02-03T15:20:46Z</dc:date>
    </item>
  </channel>
</rss>

