Performance of sqrt - Page 3

Christian_M_2 · ‎02-01-2013

Hello,

I am using the intrinsic for square root. I know from the Optimization manual I could use reciprocal square root and aproximation algorithm. But I need the accuracy.

The thing is that AVX shows no improvement over SSE. Intrinsics guide gave me some hints. Is it true that the square root operation is not pipeling for both SSE and AVX? At least latency and througput indicte this. I mean AVX has twice data amount per operation but a double of latency and half of througput means all combined same performance? Is it so?

My testsystem is an i5-2410M. In the intrinsics guide (I updated to the newest version) I only find latency and througput for Sandy Bridge. Has performance of this commands improved in Ivy Bridge? Could anyone explain the CPUID(s) a little bit? 06_2A means Sandy Bridge or does it not? Does this account for all Sandy Bridge CPUs (regardless of Desktop or Mobile or i3, i5, i7)?

For CPUID(s) I found: http://software.intel.com/en-us/articles/intel-architecture-and-processor-identification-with-cpuid-model-and-family-numbers

Does the intrinsics guide refer to a combination of family and model number? What about model numbers not mentioned in the intrinsics guide like Ivy Bridge?

SergeyKostrov · ‎02-05-2013

>>...which of the mathematical algorithms used to calculated sqrt is implemented in hardware/microcode by >>the SSE sqrt instructions... I won't be surprised if it is a highly optimized version of Newton-Raphson Square Root algorithm and it would be nice to hear from Intel software engineers. >>...As I promised I'll do a verification and results will be posted ( unfortunately, only for Ivy Bridge )... Iliya, do you have a computer with a CPU that support AVX? I need an independent verification of my test results and I really have lots of questions to Christian with regard to his statement: ...The thing is that AVX shows no improvement over SSE...

SergeyKostrov · ‎02-05-2013

...The thing is that AVX shows no improvement over SSE... Christian, How did you come up to that conclusion? Could you follow up, please? My test results show that AVX-based sqrt is ~6x faster than SSE-based sqrt.

Bernard · ‎02-05-2013

>>>Iliya, do you have a computer with a CPU that support AVX? I need an independent verification of my test results and I really have lots of questions to Christian with regard to his statement:>>>

Sorry Sergey but I still have only Core i3.I can run your tests for SSE verification only.

Bernard · ‎02-05-2013

>>...which of the mathematical algorithms used to calculated sqrt is implemented in hardware/microcode by >> the SSE sqrt instructions...

>>>I won't be surprised if it is a highly optimized version of Newton-Raphson Square Root algorithm and it would be nice to hear from Intel software engineers.>>>

Yes I thought the same.By looking at the algorithm one can see that it implements costly(for the hardware) division per every iteration so I think that Intel engineers probably optimized this part of the algorithm.

>>>CrtSqrt - RTdouble Calculating the Square Root of 625.000 - 94 ticks 625.000^0.5 = 25.000>>>

Interesting case is CRT sqrt function which is slower than SSE and AVX counterparts.I suppose when disassembled it calls fsqrt x87 instruction which itself has the latency of 10-24 core clock cycles(as reported by Agner tables).It would be nice to test the FSQRT accuracy against the AVX VSQRTPD result.FSQRT can use long double precision types for intermediate calculation stage in order to diminish rounding errors and to preserve accuracy of the result.Longer execution time of Library sqrt function is probably due to additional C code which wraps FSQRT instruction and performes an input checking.

@Sergey can you force compiler to inline calls to CRT sqrt function?

SergeyKostrov · ‎02-06-2013

>>...CRT sqrt function which is slower than SSE and AVX counterparts. I suppose when disassembled it calls fsqrt x87 instruction >>which itself has the latency of 10-24 core clock cycles(as reported by Agner tables). There are two issues: a call overhead ( parameters verifications, etc ) and it could be dependent ( possibly ) on a setting of _set_SSE2_enable function ( I didn't verify it ). >>It would be nice to test the FSQRT accuracy against the AVX VSQRTPD result. Yes, but this is another set of tests and I won't have time for it. >>FSQRT can use long double precision types for intermediate calculation stage in order to diminish rounding errors and >>to preserve accuracy of the result. HrtSqrt is actually based on it. >>Longer execution time of Library sqrt function is probably due to additional C code which wraps FSQRT instruction and >>performes an input checking. Yes and this is what I called '...a call overhead...' before.

SergeyKostrov · ‎02-06-2013

>>...can you force compiler to inline calls to CRT sqrt function?.. Yes, but it won't improve performance significantly (!) since '...parameters verifications, etc...' must be done anyway in the testing for loop.

SergeyKostrov · ‎02-06-2013

>>...My test results show that AVX-based sqrt is ~6x faster than SSE-based sqrt. Here are results: [ Intel C++ compiler 13.0.089 / 32-bit / Release Configuration ] CPU: Intel(R) Core(TM) i7-3840QM CPU @ 2.80GHz Optimization: Maximize Speed (/O2) Code Generation: Add Processor-Optimized Code Path: Intel(R) Core(TM) processor family with Intel(R) Advanced Vector Extensions support (/QaxAVX) Intel Processor-Specific Optimization: Intel(R) Core(TM) processor family with Intel(R) Advanced Vector Extensions support (/QxAVX) Floating Point Model: Fast=2 (/fp:fast=2) [Intel C++] User Sqrt - RTfloat Calculating the Square Root of 625.000 - 140 ticks 625.000^0.5 = 25.000 User Sqrt - RTdouble Calculating the Square Root of 625.000 - 188 ticks 625.000^0.5 = 25.000 CrtSqrt - RTfloat Calculating the Square Root of 625.000 - 94 ticks 625.000^0.5 = 25.000 CrtSqrt - RTdouble Calculating the Square Root of 625.000 - 93 ticks 625.000^0.5 = 25.000 HrtSqrt - RTdouble Calculating the Square Root of 625.000 - 94 ticks 625.000^0.5 = 25.000 SSE Sqrt - RTfloat Calculating the Square Root of 625.000 - 47 ticks ( to calculate square roots for 4 floats ) 625.000^0.5 = 25.000 F32vec4 class - RTfloat Calculating the Square Root of 625.000 - 47 ticks ( to calculate square roots for 4 floats ) 625.000^0.5 = 25.000 AVX Sqrt - RTfloat Calculating the Square Root of 625.000 - 15 ticks ( to calculate square roots for 8 floats ) 625.000^0.5 = 25.000

SergeyKostrov · ‎02-06-2013

>>...My test results show that AVX-based sqrt is ~6x faster than SSE-based sqrt. >>... >>SSE Sqrt - RTfloat >>Calculating the Square Root of 625.000 - 47 ticks ( to calculate square roots for 4 floats ) >>625.000^0.5 = 25.000 >>... >>AVX Sqrt - RTfloat >>Calculating the Square Root of 625.000 - 15 ticks ( to calculate square roots for 8 floats ) >>625.000^0.5 = 25.000 This is how I've done assesment: - Normalization factor is 2 = 8 ( floats ) / 4 ( floats ). - Then, ( 47 ( ticks ) / 15( ticks ) ) * 2 ~= 6

Bernard · ‎02-06-2013

>>>- Normalization factor is 2 = 8 ( floats ) / 4 ( floats ). - Then, ( 47 ( ticks ) / 15( ticks ) ) * 2 ~= 6>>> Thanks for clarifying this.I was wondering how did you get a 6x improvement in speed of execution.

Bernard · ‎02-06-2013

<<>> Btw hardware accelerated AVX and SSE sqrt implementations are also performing input checking and validation , but it is done at microcode/hardware level and the latency is lower(as expected) when compared to library sqrt(instruction decoding and sending to execution units takes some time).

Bernard · ‎02-06-2013

>>>There are two issues: a call overhead ( parameters verifications, etc )>>> @Sergey It seems to me that my last message was improperly formatted.Double post above is related to the quoted sentence

SergeyKostrov · ‎02-06-2013

>>...I was wondering how did you get a 6x improvement in speed of execution... Do you want me to post a test case for SSE and AVX sqrt intrinsics?

Bernard · ‎02-07-2013

Sergey Kostrov wrote:

>>...I was wondering how did you get a 6x improvement in speed of execution...

Do you want me to post a test case for SSE and AVX sqrt intrinsics?

No thanks.I looked at your explanation and I understood how it was calculated.

Christian_M_2 · ‎02-07-2013

Sorry for the late answer,

thanks for the lots of input. As to the wikipedia link: Does this mean Ivy Bridge as Sandy Bridge only with shrinked structure (and maybe some minor improvements).

To the tests: I read the answers in the thread once again more detailed. But I think you get AVX to be a lot of faster then SSE. Should I test my test code? Or let me run your code in my machine? Maybe my test system with Sandy Bridge i5-2410M is slower?

In addition my test result are quite stable: I tested VS2010, VS2012 and Intel Compiler (a week ago so not newest version, I think there was an update the last days). In all I get speedup of 2 for SSE and AVX as double is used and 4 for floats. I always used standard Release Config 32 bit and 64 bit. And 2 other configurations with /arch:AVX for MS Compiler and /QxAVX and /QxaAVX for Intel compiler.

Or it is because of my time meassurement? I use function clock() and calc the difference from two variables with static_cast <double> (End - Start) / static_cast <double> (CLOCKS_PER_SEC)

My normal version:

    for (size_t k = 0; k < mInput.size(); k++)
   {
       mResult = std::sqrt(mInput);
   }

My SSE version:

    size_t const incr = 128 / (8 * sizeof(double));

   for (size_t k = 0; k < mInput.size(); k += incr)
   {
       __m128d val = _mm_loadu_pd(&mInput);

       val = _mm_sqrt_pd(val);

       _mm_storeu_pd(&mResult, val);
   }

My AVX version:

    size_t const incr = 256 / (8 * sizeof(double));

   for (size_t k = 0; k < mInput.size(); k += incr)
   {
       __m256d val = _mm256_loadu_pd(&mInput);

       val = _mm256_sqrt_pd(val);

       _mm256_storeu_pd(&mResult, val);
   }

Christian_M_2 · ‎02-07-2013

Additional information:

size of input vector: 5000000

repeating test: 5000

I summed up the times of all test repetitions. Some additional code is used to avoid "wrong" compiler optimization. In the beginning code was optimized nearly away as I did not use result data.

Christian_M_2 · ‎02-07-2013

So one post more from me in a row:

I also checked the disassembly of my code, but there was nothing unexpected.

And to the CRT sqrt function. Simply call std::sqrt, debug it and go to the disassembly window. Now do single steps. This way you can get into CRT assembler code. There is a lot of input checking done before sqrt is called.

What code exactly differs from some settings. VS2010 for example generates FPU instructions for 32bit and skalar SSE for 64bit. Intel generate skalar SSE for both 32 and 64 bit.

TimP · ‎02-07-2013

It looks like you may have made the test big enough to exceed cache, so it might be expected to be bandwidth limited.

Did you take care to make at least the store aligned? If the loads are unaligned, and you don't have at least corei7-3, splitting them explicitly into 128-bit loads is expected to be faster (although maybe not worth the effort if you want readable intrinsics).

As you commented earlier, now that you have shown code excerpts, there is nothing here to produce better performance with AVX on current platforms.

SergeyKostrov · ‎02-07-2013

Thanks for the feedback and the test case! >>...Should I test my test code? Or let me run your code in my machine? Maybe my test system with Sandy Bridge >>i5-2410M is slower? I don't try to compromise your results and I simply would like to see that Intel's AVX provides advantages over SSE. I think the best way to proceed is a consolidated set of tests ( SSE vs. AVX ) in a Visual Studio project and I'll take care of it. I'll upload the project as soon as it is ready.

SergeyKostrov · ‎02-07-2013

>>...Some additional code is used to avoid "wrong" compiler optimization. In the beginning code was optimized nearly away as >>I did not use result data... I had the same problem and I'll create a new thread on Intel C++ compiler forum some time later.

Bernard · ‎02-07-2013

>>>And to the CRT sqrt function. Simply call std::sqrt, debug it and go to the disassembly window. Now do single steps. This way you can get into CRT assembler code. There is a lot of input checking done before sqrt is called>>>

Thanks for your advise,but I prefer to work with the help of IDA Pro disassembler and windbg.

Bernard · ‎02-07-2013

>>>What code exactly differs from some settings. VS2010 for example generates FPU instructions for 32bit and skalar SSE for 64bit. Intel generate skalar SSE for both 32 and 64 bit>>>

That's true.