Performance of sqrt - Page 2

Christian_M_2 · ‎02-01-2013

Hello,

I am using the intrinsic for square root. I know from the Optimization manual I could use reciprocal square root and aproximation algorithm. But I need the accuracy.

The thing is that AVX shows no improvement over SSE. Intrinsics guide gave me some hints. Is it true that the square root operation is not pipeling for both SSE and AVX? At least latency and througput indicte this. I mean AVX has twice data amount per operation but a double of latency and half of througput means all combined same performance? Is it so?

My testsystem is an i5-2410M. In the intrinsics guide (I updated to the newest version) I only find latency and througput for Sandy Bridge. Has performance of this commands improved in Ivy Bridge? Could anyone explain the CPUID(s) a little bit? 06_2A means Sandy Bridge or does it not? Does this account for all Sandy Bridge CPUs (regardless of Desktop or Mobile or i3, i5, i7)?

For CPUID(s) I found: http://software.intel.com/en-us/articles/intel-architecture-and-processor-identification-with-cpuid-model-and-family-numbers

Does the intrinsics guide refer to a combination of family and model number? What about model numbers not mentioned in the intrinsics guide like Ivy Bridge?

SergeyKostrov · ‎02-03-2013

>>>>These imprecise operations are available via Intel compiler options... That is correct. However, from my point of view and experience, a more flexible way to control precision is a precision control at run-time. >> >>Wow, this information is quite new. I did not know one could control accuracy... Please take a look at a _control87 CRT-function. Note: We recently had a very good discussion regarding precision issues and, if interested, please take a look at: Forum topic: Mathimf and windows Web-link: http://software.intel.com/en-us/forums/topic/357759

SergeyKostrov · ‎02-03-2013

>>>>These imprecise operations are available via Intel compiler options... That is correct. However, from my point of view and experience, a more flexible way to control precision is a precision control at run-time. >> >>Wow, this information is quite new. I did not know one could control accuracy... Please take a look at a _control87 CRT-function. Note: We recently had a very good discussion regarding precision issues and, if interested, please take a look at: Forum topic: Mathimf and windows

SergeyKostrov · ‎02-03-2013

Sorry, I forgot to specify a forum's name... >>Note: We recently had a very good discussion regarding precision issues and, if interested, please take a look at: >> >>Forum topic: Mathimf and windows It is in Intel C++ compiler forum.

SergeyKostrov · ‎02-03-2013

>>...Ivy Bridge should have 06_3A. But I can't find it in the Intrinsics guide for any instructions (I have not checked >>every but those that are imporant for me)... Christian, Please take a look at Table 3-18. Highest CPUID Source Operand for Intel 64 and IA-32 Processors ( page 212 ) in Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 2 (2A, 2B & 2C): Instruction Set Reference, A-Z Order Number: 325383-044US August 2012

SergeyKostrov · ‎02-04-2013

>>...Ivy Bridge should have 06_3A. But I can't find it in the Intrinsics guide for any instructions (I have not checked >>every but those that are imporant for me)... This is what my CPUID test case displays: ... CPU Brand String: Intel(R) Core(TM) i7-3840QM CPU @ 2.80GHz CPU Vendor: GenuineIntel Stepping ID = 9 Model = 10 Family = 6 Extended Model = 3 ...

SergeyKostrov · ‎02-04-2013

>>>>...Let me know if you need real performance numbers for different sqrt functions and floating-point types... >>>> >>This would be great! I am especially interested on the performance of the precise square root operation. Different CPUs would be >>a good indicator... In general all tests based on the following for loop: ... int iNumberOfIterations = 16777216; // 2^24 g_uiTicksStart = ::GetTickCount(); for( int t = 0; t < iNumberOfIterations; t++ ) { ... } g_uiTicksEnd = ::GetTickCount(); printf( RTU(" - %ld ticks\n"), ( int )( g_uiTicksEnd - g_uiTicksStart ) ); ... for Microsoft C++ compiler, Debug and Release configurations, and without any optimizations.

SergeyKostrov · ‎02-04-2013

[ Microsoft C++ compiler / Debug configurations ] CPU: Intel(R) Core(TM) i7-3840QM CPU @ 2.80GHz User Sqrt - RTfloat Calculating the Square Root of 625.000 - 296 ticks 625.000^0.5 = 25.000 User Sqrt - RTdouble Calculating the Square Root of 625.000 - 281 ticks 625.000^0.5 = 25.000 CrtSqrt - RTfloat Calculating the Square Root of 625.000 - 577 ticks 625.000^0.5 = 25.000 CrtSqrt - RTdouble Calculating the Square Root of 625.000 - 593 ticks 625.000^0.5 = 25.000 HrtSqrt - RTdouble Calculating the Square Root of 625.000 - 593 ticks 625.000^0.5 = 25.000 SSE Sqrt - RTfloat Calculating the Square Root of 625.000 - 343 ticks 625.000^0.5 = 25.000 F32vec4 class - RTfloat Calculating the Square Root of 625.000 - 3011 ticks 625.000^0.5 = 25.000 CPU: Intel(R) Pentium(R) 4 CPU 1.60GHz User Sqrt - RTfloat Calculating the Square Root of 625.000 - 984 ticks 625.000^0.5 = 25.000 User Sqrt - RTdouble Calculating the Square Root of 625.000 - 969 ticks 625.000^0.5 = 25.000 CrtSqrt - RTfloat Calculating the Square Root of 625.000 - 2422 ticks 625.000^0.5 = 25.000 CrtSqrt - RTdouble Calculating the Square Root of 625.000 - 2500 ticks 625.000^0.5 = 25.000 HrtSqrt - RTdouble Calculating the Square Root of 625.000 - 2672 ticks 625.000^0.5 = 25.000 SSE Sqrt - RTfloat Calculating the Square Root of 625.000 - 1406 ticks 625.000^0.5 = 25.000 F32vec4 class - RTfloat Calculating the Square Root of 625.000 - 11187 ticks 625.000^0.5 = 25.000

SergeyKostrov · ‎02-04-2013

[ Microsoft C++ compiler / Release configurations ] CPU: Intel(R) Core(TM) i7-3840QM CPU @ 2.80GHz User Sqrt - RTfloat Calculating the Square Root of 625.000 - 281 ticks 625.000^0.5 = 25.000 User Sqrt - RTdouble Calculating the Square Root of 625.000 - 297 ticks 625.000^0.5 = 25.000 CrtSqrt - RTfloat Calculating the Square Root of 625.000 - 93 ticks 625.000^0.5 = 25.000 CrtSqrt - RTdouble Calculating the Square Root of 625.000 - 94 ticks 625.000^0.5 = 25.000 HrtSqrt - RTdouble Calculating the Square Root of 625.000 - 93 ticks 625.000^0.5 = 25.000 SSE Sqrt - RTfloat Calculating the Square Root of 625.000 - 318 ticks 625.000^0.5 = 25.000 F32vec4 class Calculating the Square Root of 625.000 - 406 ticks 625.000^0.5 = 25.000 CPU: Intel(R) Pentium(R) 4 CPU 1.60GHz User Sqrt - RTfloat Calculating the Square Root of 625.000 - 985 ticks 625.000^0.5 = 25.000 User Sqrt - RTdouble Calculating the Square Root of 625.000 - 969 ticks 625.000^0.5 = 25.000 CrtSqrt - RTfloat Calculating the Square Root of 625.000 - 406 ticks 625.000^0.5 = 25.000 CrtSqrt - RTdouble Calculating the Square Root of 625.000 - 406 ticks 625.000^0.5 = 25.000 HrtSqrt - RTdouble Calculating the Square Root of 625.000 - 406 ticks 625.000^0.5 = 25.000 SSE Sqrt - RTfloat Calculating the Square Root of 625.000 - 1422 ticks 625.000^0.5 = 25.000 F32vec4 class - RTfloat Calculating the Square Root of 625.000 - 1953 ticks 625.000^0.5 = 25.000

SergeyKostrov · ‎02-04-2013

>>...I wounder whether the results also differ within a CPU family... I can't verify it. However, when it comes to precision if a 53-bit precision is set then results must be the same for all CPUs.

TimP · ‎02-04-2013

In the particular case where your operands can be expressed exactly in 12 bits precision, it seems that your accuracy doesn't vary among these methods. Accuracy of the sqrt reciprocal approximation varies between AMD CPU families, but I think Intel tried to keep it the same.

If you wished to test accuracy of sqrt without going through an exhaustive list of cases, you could try something like the Paranoia benchmark.

The earliest AMD families had a 14-bit approximation which would be sufficient to obtain 52 bits after 2 iterations; this has been considered at Intel but I don't know of it ever being adopted.

Bernard · ‎02-04-2013

Thanks for posting sqrt(x) test case.

What is this sqrt(x) implementation "User Sqrt - RTfloat"?

Do you have results for SSE sqrt(x) where x = double primitive type?

SergeyKostrov · ‎02-04-2013

>>What is this sqrt(x) implementation "User Sqrt - RTfloat"?.. It is based on a classic iterative method and I'll provide more details later. >>Do you have results for SSE sqrt(x) where x = double primitive type?.. No. If you decide to test it you will need to use: __m128d _mm_sqrt_pd( __m128d ) Note: It is the same as SQRTPD instruction.

SergeyKostrov · ‎02-04-2013

Hi everybody, Next three test results demonstrate what the latest version of Intel C++ compiler can do...

SergeyKostrov · ‎02-04-2013

[ Intel C++ compiler 13.0.089 / 32-bit / Release Configuration ] CPU: Intel(R) Core(TM) i7-3840QM CPU @ 2.80GHz Optimization: Maximize Speed (/O2) Code Generation: Add Processor-Optimized Code Path: Intel(R) Core(TM) processor family with Intel(R) Advanced Vector Extensions support (/QaxAVX) Intel Processor-Specific Optimization: Intel(R) Core(TM) processor family with Intel(R) Advanced Vector Extensions support (/QxAVX) Floating Point Model: Precise (/fp:precise) User Sqrt - RTfloat Calculating the Square Root of 625.000 - 265 ticks 625.000^0.5 = 25.000 User Sqrt - RTdouble Calculating the Square Root of 625.000 - 203 ticks 625.000^0.5 = 25.000 CrtSqrt - RTfloat Calculating the Square Root of 625.000 - 93 ticks 625.000^0.5 = 25.000 CrtSqrt - RTdouble Calculating the Square Root of 625.000 - 94 ticks 625.000^0.5 = 25.000 HrtSqrt - RTdouble Calculating the Square Root of 625.000 - 93 ticks 625.000^0.5 = 25.000 SSE Sqrt - RTfloat Calculating the Square Root of 625.000 - 47 ticks 625.000^0.5 = 25.000 F32vec4 class - RTfloat Calculating the Square Root of 625.000 - 47 ticks 625.000^0.5 = 25.000 Note 1: 47 ticks for 2^24 iterations! Note 2: 1 sec is 1000 ticks.

SergeyKostrov · ‎02-04-2013

[ Intel C++ compiler 13.0.089 / 32-bit / Release Configuration ] CPU: Intel(R) Core(TM) i7-3840QM CPU @ 2.80GHz Optimization: Maximize Speed (/O2) Code Generation: Add Processor-Optimized Code Path: Intel(R) Core(TM) processor family with Intel(R) Advanced Vector Extensions support (/QaxAVX) Intel Processor-Specific Optimization: Intel(R) Core(TM) processor family with Intel(R) Advanced Vector Extensions support (/QxAVX) Floating Point Model: Fast (/fp:fast) User Sqrt - RTfloat Calculating the Square Root of 625.000 - 140 ticks 625.000^0.5 = 25.000 User Sqrt - RTdouble Calculating the Square Root of 625.000 - 188 ticks 625.000^0.5 = 25.000 CrtSqrt - RTfloat Calculating the Square Root of 625.000 - 93 ticks 625.000^0.5 = 25.000 CrtSqrt - RTdouble Calculating the Square Root of 625.000 - 94 ticks 625.000^0.5 = 25.000 HrtSqrt - RTdouble Calculating the Square Root of 625.000 - 93 ticks 625.000^0.5 = 25.000 SSE Sqrt - RTfloat Calculating the Square Root of 625.000 - 47 ticks 625.000^0.5 = 25.000 F32vec4 class - RTfloat Calculating the Square Root of 625.000 - 47 ticks 625.000^0.5 = 25.000

SergeyKostrov · ‎02-04-2013

[ Intel C++ compiler 13.0.089 / 32-bit / Release Configuration ] CPU: Intel(R) Core(TM) i7-3840QM CPU @ 2.80GHz Optimization: Maximize Speed (/O2) Code Generation: Add Processor-Optimized Code Path: Intel(R) Core(TM) processor family with Intel(R) Advanced Vector Extensions support (/QaxAVX) Intel Processor-Specific Optimization: Intel(R) Core(TM) processor family with Intel(R) Advanced Vector Extensions support (/QxAVX) Floating Point Model: Fast=2 (/fp:fast=2) [Intel C++] User Sqrt - RTfloat Calculating the Square Root of 625.000 - 140 ticks 625.000^0.5 = 25.000 User Sqrt - RTdouble Calculating the Square Root of 625.000 - 187 ticks 625.000^0.5 = 25.000 CrtSqrt - RTfloat Calculating the Square Root of 625.000 - 94 ticks 625.000^0.5 = 25.000 CrtSqrt - RTdouble Calculating the Square Root of 625.000 - 94 ticks 625.000^0.5 = 25.000 HrtSqrt - RTdouble Calculating the Square Root of 625.000 - 93 ticks 625.000^0.5 = 25.000 SSE Sqrt - RTfloat Calculating the Square Root of 625.000 - 47 ticks 625.000^0.5 = 25.000 F32vec4 class - RTfloat Calculating the Square Root of 625.000 - 46 ticks 625.000^0.5 = 25.000

SergeyKostrov · ‎02-04-2013

>>>>What is this sqrt(x) implementation "User Sqrt - RTfloat"?.. >> >>It is based on a classic iterative method and I'll provide more details later. Here it is: ... RTint iNumberOfIterations = _RTNUMBER_OF_TESTS_0016777216; // 2^24 // Sub-Test 1 - User Sqrt - RTfloat { CrtPrintf( RTU("User Sqrt - RTfloat\n") ); RTfloat fA = 625.00f; RTfloat fG = 625.00f; RTfloat fQ = 0.0f; CrtPrintf( RTU("Calculating the Square Root of %.3f"), fA ); g_uiTicksStart = SysGetTickCount(); for( RTint t = 0; t < iNumberOfIterations; t++ ) { fQ = 0.0L; while( RTtrue ) { if( ( fQ - fG ) > -0.00001f ) break; fQ = fA / fG; fG = ( 0.5f * fG + 0.5f * fQ ); } } CrtPrintf( RTU(" - %ld ticks\n"), ( RTint )( SysGetTickCount() - g_uiTicksStart ) ); CrtPrintf( RTU("%.3f^0.5 = %.3f\n"), fA, fG ); } ...

Bernard · ‎02-04-2013

>>>SSE Sqrt - RTfloat Calculating the Square Root of 625.000 - 47 ticks 625.000^0.5 = 25.000>>> It is interesting which of the sqrt calculation methods does hardware accelerated SSE instruction use?

SergeyKostrov · ‎02-05-2013

>>...which of the sqrt calculation methods does hardware accelerated SSE instruction use? The last two, SSE Sqrt and F32vec4 class. I will do an evaluation of AVX sqrt-intrinsic functions: ... /* * Square Root of Double-Precision Floating-Point Values * **** VSQRTPD ymm1, ymm2/m256 * Performs an SIMD computation of the square roots of the two or four packed * double-precision floating-point values in the source operand and stores * the packed double-precision floating-point results in the destination */ extern __m256d __cdecl _mm256_sqrt_pd(__m256d a); /* * Square Root of Single-Precision Floating-Point Values * **** VSQRTPS ymm1, ymm2/m256 * Performs an SIMD computation of the square roots of the eight packed * single-precision floating-point values in the source operand stores the * packed double-precision floating-point results in the destination */ extern __m256 __cdecl _mm256_sqrt_ps(__m256 a); ... some time later and I'd like to verify Christian's statement '...The thing is that AVX shows no improvement over SSE...'

SergeyKostrov · ‎02-05-2013

Christian, Have you seen that picture: en.wikipedia.org/wiki/Ivy_Bridge_(microarchitecture)#Roadmap on the Wiki? >>...Has performance of this commands improved in Ivy Bridge? As I promised I'll do a verification and results will be posted ( unfortunately, only for Ivy Bridge ).

Bernard · ‎02-05-2013

>>>The last two, SSE Sqrt and F32vec4 class. I will do an evaluation of AVX sqrt-intrinsic functions:>>> It seems that I had wrongly formulated my question.I wanted to ask which of the mathematical algorithms used to calculated sqrt is implemented in hardware/microcode by the SSE sqrt instructions.I have found this paper "Fast Floating Point Square Root".