Performance of sqrt

Christian_M_2 · ‎02-01-2013

Hello,

I am using the intrinsic for square root. I know from the Optimization manual I could use reciprocal square root and aproximation algorithm. But I need the accuracy.

The thing is that AVX shows no improvement over SSE. Intrinsics guide gave me some hints. Is it true that the square root operation is not pipeling for both SSE and AVX? At least latency and througput indicte this. I mean AVX has twice data amount per operation but a double of latency and half of througput means all combined same performance? Is it so?

My testsystem is an i5-2410M. In the intrinsics guide (I updated to the newest version) I only find latency and througput for Sandy Bridge. Has performance of this commands improved in Ivy Bridge? Could anyone explain the CPUID(s) a little bit? 06_2A means Sandy Bridge or does it not? Does this account for all Sandy Bridge CPUs (regardless of Desktop or Mobile or i3, i5, i7)?

For CPUID(s) I found: http://software.intel.com/en-us/articles/intel-architecture-and-processor-identification-with-cpuid-model-and-family-numbers

Does the intrinsics guide refer to a combination of family and model number? What about model numbers not mentioned in the intrinsics guide like Ivy Bridge?

Bernard · ‎02-01-2013

>>>Could anyone explain the CPUID(s) a little bit? 06_2A means Sandy Bridge or does it >>>

That means 32 nm Sandy Bridge microarchitecture.

Please look at this link which is more related to the speed of execution(comparision between SSE sqrt(x) and invsqrt multiplied by x)

http://stackoverflow.com/questions/1528727/why-is-sse-scalar-sqrtx-slower-than-rsqrtx-x

Christian_M_2 · ‎02-01-2013

iliyapolak wrote:

>>>Could anyone explain the CPUID(s) a little bit? 06_2A means Sandy Bridge or does it >>>

That means 32 nm Sandy Bridge microarchitecture.

This brings me already closer.

But what about Ivy Bridge and other unmentioned CPUID(s). Does anybody have some tips?

Bernard · ‎02-01-2013

>>>The thing is that AVX shows no improvement over SSE>>>

Maybe exact the microcode implementation of the sqrt algorithm is the same when AVX and SSE instruction are compared.

Christian_M_2 · ‎02-01-2013

I think AVX sqrt implementation only calls SSE implementation for lower and upper YMM register. As latency is doubled for double data amount.

But I am not sure, whether this is for all Sandy Bridge or only because I test on a middle class Sandy Bridge for mobile notebooks.

TimP · ‎02-01-2013

As Christian hinted, the hardware implementation of IEEE divide and sqrt on Sandy and Ivy Bridge sequences the operands into AVX-128 pairs, so it's likely there is little performance gain for AVX-256 vs. SSE/SSE2 or AVX-128. Ivy Bridge greatly reduces the latency so there may not be an incentive to employ the reciprocal and iteration option. That is termed a throughput optimization for single/float precision, in that it opens up opportunity for instruction level parallelism in a loop which has significant work in instructions other than divide/sqrt.

Bernard · ‎02-01-2013

@Tim

Is it possible to obtain an information about the exact algorithm used to calculate sqrt values on Intel CPU's?

SergeyKostrov · ‎02-01-2013

Christian, Let me know if you need real performance numbers for different sqrt functions and floating-point types ( 6 tests in total / 5 different C++ compilers ). I can do it for Intel Core i7-3840QM ( Ivy Bridge / 4 cores ) and older CPUs, for example Intel Pentium 4.

SergeyKostrov · ‎02-01-2013

>>...As latency is doubled for double data amount... In SSE performance numbers are almost the same for the following test cases: [ Test-case 1 ] RTfloat fA = 625.0f; mmValue.m128_f32[0] = ( RTfloat )fA; mmValue.m128_f32[1] = ( RTfloat )0.0f; // since this is 0.0 sqrt shouldn't be calculated mmValue.m128_f32[2] = ( RTfloat )0.0f; // since this is 0.0 sqrt shouldn't be calculated mmValue.m128_f32[3] = ( RTfloat )0.0f; // since this is 0.0 sqrt shouldn't be calculated mmResult = _mm_sqrt_ps( mmValue ); [ Test-case 2 ] RTfloat fA = 625.0f; mmValue.m128_f32[0] = ( RTfloat )fA; mmValue.m128_f32[1] = ( RTfloat )fA; mmValue.m128_f32[2] = ( RTfloat )fA; mmValue.m128_f32[3] = ( RTfloat )fA; mmResult = _mm_sqrt_ps( mmValue );

SergeyKostrov · ‎02-01-2013

>>...06_2A means Sandy Bridge or does it not?.. I'll take a look. In general, you need to get more detailed information like: ... CPU Brand String: Intel(R) Atom(TM) CPU N270 @ 1.60GHz CPU Vendor : GenuineIntel Stepping ID = 2 Model = 12 Family = 6 Extended Model = 1 ... and then to "map" these numbers to codes in the manual.

Christian_M_2 · ‎02-02-2013

TimP (Intel) wrote:

As Christian hinted, the hardware implementation of IEEE divide and sqrt on Sandy and Ivy Bridge sequences the operands into AVX-128 pairs, so it's likely there is little performance gain for AVX-256 vs. SSE/SSE2 or AVX-128. Ivy Bridge greatly reduces the latency so there may not be an incentive to employ the reciprocal and iteration option. That is termed a throughput optimization for single/float precision, in that it opens up opportunity for instruction level parallelism in a loop which has significant work in instructions other than divide/sqrt.

The thing with Ivy Bridge is really interesting. With add and mul Sandy Bridge already allows quite good instruction level parallelism. If one result is not directly base on the operations before, one can fill the pipeline very well and get a result per clock nearly, I suppose.

Can one find the optimizations of Ivy Bridge also in the Intrinsics guide? I do not find the appropriate CPUID. If 06_2A is Sandy Bridge, then according to the table from http://software.intel.com/en-us/articles/intel-architecture-and-processor-identification-with-cpuid-model-and-family-numbers, Ivy Bridge should have 06_3A. But I can't find it in the Intrinsics guide for any instructions (I have not checked every but those that are imporant for me).

Sergey Kostrov wrote:

Christian,

Let me know if you need real performance numbers for different sqrt functions and floating-point types ( 6 tests in total / 5 different C++ compilers ). I can do it for Intel Core i7-3840QM ( Ivy Bridge / 4 cores ) and older CPUs, for example Intel Pentium 4.

This would be great! I am especially interested on the performance of the precise square root operation. Different CPUs would be a good indicator. I wounder whether the results also differ within a CPU family.

You mentioned that I should "map these numbers to codes in the manual. Which manual are you talking about exactly?

SergeyKostrov · ‎02-02-2013

>>... Which manual are you talking about exactly? Please take a look at: http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html

SergeyKostrov · ‎02-02-2013

Here are a couple of more links & tips: - You need to look at Intel 64 and IA-32 Architectures Optimization Reference Manual, APPENDIX C, INSTRUCTION LATENCY AND THROUGHPUT - Try to use msinfo32.exe utility ( it provides some CPU information ) - http://ark.intel.com -> http://ark.intel.com/products/52224/Intel-Core-i5-2410M-Processor-3M-Cache-up-to-2_90-GHz?q=i5-2410M Note: Take a look at a datasheet for your i5-2410M CPU in a Quick Links section ( on the right side of the web page ) - http://software.intel.com/en-us/forums/topic/278742

Bernard · ‎02-02-2013

>>>I am especially interested on the performance of the precise square root operation>>>

Here you have a very interesting discussion about the hardware accelereted sqrt calculation

http://stackoverflow.com/questions/1528727/why-is-sse-scalar-sqrtx-slower-than-rsqrtx-x

Bernard · ‎02-02-2013

>>>I am especially interested on the performance of the precise square root operation.>>>

Follow this link : http://stackoverflow.com/questions/1528727/why-is-sse-scalar-sqrtx-slower-than-rsqrtx-x

Bernard · ‎02-02-2013

>>>With add and mul Sandy Bridge already allows quite good instruction level parallelism>>>

Sandy Bridge really improved instruction level parallelism by adding one or two new ports to the execution cluster.So for example when your code has fp add(one vector addition) and fp mul(one vector multiplication) both without beign interdependent on each other they can be executed simultaneously.

TimP · ‎02-03-2013

">>>I am especially interested on the performance of the precise square root operation.>>>

Follow this link : http://stackoverflow.com/questions/1528727/why-is-sse-scalar-sqrtx-slowe..."

These imprecise operations are available via Intel compiler options

/Qimf-accuracy-bits:bits[:funclist]
          define the relative error, measured by the number of correct bits,
          for math library function results
            bits     - a positive, floating-point number
            funclist - optional comma separated list of one or more math
                       library functions to which the attribute should be
                       applied

So you can request the 13-bit accuracy implementation of divide and sqrt. Iterative methods with less than full precision can be produced by requesting 20- 40- or 49-bit accuracy. 22-bit accuracy is the default for single precision vectorization; -Qprec-div -Qprec-sqrt (implied by /fp:source|precise) changes default to 24/53-bit accuracy. Beginning with Harpertown, the IEEE instructions, referred to as "native" in your references, have been quite competitive for SSE/SSE2. Original core 2 duo with the slower divide and sqrt is no longer in production. I turned mine in after 4.5 years rather than re-install WIndows a 4th time.

The x87 divide and sqrt also support a trade-off between speed and precision, by setting 24-, 53- (default for Intel and Microsoft compilers) or 64- (hardware default, /Qpc80) bit precision mode.

You also have the choice, since SSE, of gradual underflow (/Qftz-) to maintain precision in the presence of partial underflow. Sandy Bridge removes the performance penalty for /Qftz- in most common situations. This was done in part because it's not convenient to set abrupt underflow when using Microsoft or gnu compilers.

All these options are more than most developers are willing to bargain for (and QA test). That's one of the reasons for availability of IEEE standard compliant instructions and for progress at the hardware level in making them more efficient.

Christian_M_2 · ‎02-03-2013

>>>Follow this link : http://stackoverflow.com/questions/1528727/why-is-sse-scalar-sqrtx-slowe...

Quite interesting discussion, it provides a lot of information.

I found the following discussion about square root and AVX: http://stackoverflow.com/questions/8924729/using-avx-intrinsics-instead-of-sse-does-not-improve-speed-why

One is mentioning something about instruction emulation. Is it true that low end processor (lets take an i3 Sandy Bridge) has other execution units or less than an i7 Sandy Bridge?

Christian_M_2 · ‎02-03-2013

>>>These imprecise operations are available via Intel compiler options ...

Wow, this information is quite new. I did not know one could control accuracy.

>>> Beginning with Harpertown, the IEEE instructions, referred to as "native" in your references, have been quite competitive for SSE/SSE2.

So this was the first time IEEE compliant instructions provided quite good speed compared to other SSE/SSE2 versions?

And to x87: I found that some compilers use only x87 FPU in 32 bit mode and switching same code to compile for 64 bit mode, SSE is used (only scalar version). Is this also something can be controlled? For some algorithms high accuracy might be useful. x87 fpu provides most precision with 80 bit. This can not be achieved with SSE any more.

Bernard · ‎02-03-2013

>>>For some algorithms high accuracy might be useful. x87 fpu provides most precision with 80 bit. This can not be achieved with SSE any more.>>>

Yes because this is the developer's decision and/or project constraints to favor precision over vectorization of the code.

Bernard · ‎02-03-2013

>>>Is it true that low end processor (lets take an i3 Sandy Bridge) has other execution units or less than an i7 Sandy Bridge?>>>

I'm not sure if Core i3 has less execution units than Core i7.I think that main difference is in cache size ,TDP ,number of physical and logical cores(HT), more agressive overclocking.