- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
I am using the intrinsic for square root. I know from the Optimization manual I could use reciprocal square root and aproximation algorithm. But I need the accuracy.
The thing is that AVX shows no improvement over SSE. Intrinsics guide gave me some hints. Is it true that the square root operation is not pipeling for both SSE and AVX? At least latency and througput indicte this. I mean AVX has twice data amount per operation but a double of latency and half of througput means all combined same performance? Is it so?
My testsystem is an i5-2410M. In the intrinsics guide (I updated to the newest version) I only find latency and througput for Sandy Bridge. Has performance of this commands improved in Ivy Bridge? Could anyone explain the CPUID(s) a little bit? 06_2A means Sandy Bridge or does it not? Does this account for all Sandy Bridge CPUs (regardless of Desktop or Mobile or i3, i5, i7)?
For CPUID(s) I found: http://software.intel.com/en-us/articles/intel-architecture-and-processor-identification-with-cpuid-model-and-family-numbers
Does the intrinsics guide refer to a combination of family and model number? What about model numbers not mentioned in the intrinsics guide like Ivy Bridge?
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>>Iliya, do you have a computer with a CPU that support AVX? I need an independent verification of my test results and I really have lots of questions to Christian with regard to his statement:>>>
Sorry Sergey but I still have only Core i3.I can run your tests for SSE verification only.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>...which of the mathematical algorithms used to calculated sqrt is implemented in hardware/microcode by >> the SSE sqrt instructions...
>>>I won't be surprised if it is a highly optimized version of Newton-Raphson Square Root algorithm and it would be nice to hear from Intel software engineers.>>>
Yes I thought the same.By looking at the algorithm one can see that it implements costly(for the hardware) division per every iteration so I think that Intel engineers probably optimized this part of the algorithm.
>>>CrtSqrt - RTdouble Calculating the Square Root of 625.000 - 94 ticks 625.000^0.5 = 25.000>>>
Interesting case is CRT sqrt function which is slower than SSE and AVX counterparts.I suppose when disassembled it calls fsqrt x87 instruction which itself has the latency of 10-24 core clock cycles(as reported by Agner tables).It would be nice to test the FSQRT accuracy against the AVX VSQRTPD result.FSQRT can use long double precision types for intermediate calculation stage in order to diminish rounding errors and to preserve accuracy of the result.Longer execution time of Library sqrt function is probably due to additional C code which wraps FSQRT instruction and performes an input checking.
@Sergey can you force compiler to inline calls to CRT sqrt function?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sergey Kostrov wrote:
>>...I was wondering how did you get a 6x improvement in speed of execution...
Do you want me to post a test case for SSE and AVX sqrt intrinsics?
No thanks.I looked at your explanation and I understood how it was calculated.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sorry for the late answer,
thanks for the lots of input. As to the wikipedia link: Does this mean Ivy Bridge as Sandy Bridge only with shrinked structure (and maybe some minor improvements).
To the tests: I read the answers in the thread once again more detailed. But I think you get AVX to be a lot of faster then SSE. Should I test my test code? Or let me run your code in my machine? Maybe my test system with Sandy Bridge i5-2410M is slower?
In addition my test result are quite stable: I tested VS2010, VS2012 and Intel Compiler (a week ago so not newest version, I think there was an update the last days). In all I get speedup of 2 for SSE and AVX as double is used and 4 for floats. I always used standard Release Config 32 bit and 64 bit. And 2 other configurations with /arch:AVX for MS Compiler and /QxAVX and /QxaAVX for Intel compiler.
Or it is because of my time meassurement? I use function clock() and calc the difference from two variables with static_cast <double> (End - Start) / static_cast <double> (CLOCKS_PER_SEC)
My normal version:
for (size_t k = 0; k < mInput.size(); k++)
{
mResult
}
My SSE version:
size_t const incr = 128 / (8 * sizeof(double));
for (size_t k = 0; k < mInput.size(); k += incr)
{
__m128d val = _mm_loadu_pd(&mInput
val = _mm_sqrt_pd(val);
_mm_storeu_pd(&mResult
}
My AVX version:
size_t const incr = 256 / (8 * sizeof(double));
for (size_t k = 0; k < mInput.size(); k += incr)
{
__m256d val = _mm256_loadu_pd(&mInput
val = _mm256_sqrt_pd(val);
_mm256_storeu_pd(&mResult
}
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Additional information:
size of input vector: 5000000
repeating test: 5000
I summed up the times of all test repetitions. Some additional code is used to avoid "wrong" compiler optimization. In the beginning code was optimized nearly away as I did not use result data.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
So one post more from me in a row:
I also checked the disassembly of my code, but there was nothing unexpected.
And to the CRT sqrt function. Simply call std::sqrt, debug it and go to the disassembly window. Now do single steps. This way you can get into CRT assembler code. There is a lot of input checking done before sqrt is called.
What code exactly differs from some settings. VS2010 for example generates FPU instructions for 32bit and skalar SSE for 64bit. Intel generate skalar SSE for both 32 and 64 bit.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It looks like you may have made the test big enough to exceed cache, so it might be expected to be bandwidth limited.
Did you take care to make at least the store aligned? If the loads are unaligned, and you don't have at least corei7-3, splitting them explicitly into 128-bit loads is expected to be faster (although maybe not worth the effort if you want readable intrinsics).
As you commented earlier, now that you have shown code excerpts, there is nothing here to produce better performance with AVX on current platforms.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>>And to the CRT sqrt function. Simply call std::sqrt, debug it and go to the disassembly window. Now do single steps. This way you can get into CRT assembler code. There is a lot of input checking done before sqrt is called>>>
Thanks for your advise,but I prefer to work with the help of IDA Pro disassembler and windbg.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>>What code exactly differs from some settings. VS2010 for example generates FPU instructions for 32bit and skalar SSE for 64bit. Intel generate skalar SSE for both 32 and 64 bit>>>
That's true.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page