- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
I am using the intrinsic for square root. I know from the Optimization manual I could use reciprocal square root and aproximation algorithm. But I need the accuracy.
The thing is that AVX shows no improvement over SSE. Intrinsics guide gave me some hints. Is it true that the square root operation is not pipeling for both SSE and AVX? At least latency and througput indicte this. I mean AVX has twice data amount per operation but a double of latency and half of througput means all combined same performance? Is it so?
My testsystem is an i5-2410M. In the intrinsics guide (I updated to the newest version) I only find latency and througput for Sandy Bridge. Has performance of this commands improved in Ivy Bridge? Could anyone explain the CPUID(s) a little bit? 06_2A means Sandy Bridge or does it not? Does this account for all Sandy Bridge CPUs (regardless of Desktop or Mobile or i3, i5, i7)?
For CPUID(s) I found: http://software.intel.com/en-us/articles/intel-architecture-and-processor-identification-with-cpuid-model-and-family-numbers
Does the intrinsics guide refer to a combination of family and model number? What about model numbers not mentioned in the intrinsics guide like Ivy Bridge?
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>What code exactly differs from some settings. VS2010 for example generates FPU instructions for 32bit and skalar SSE for 64bit. Intel generate skalar SSE for both 32 and 64 bit>>>
At least since VS2005, there has been /arch:SSE2 option. VS2012 adds /arch:AVX option and limited use of parallel SSE2 or AVX instructions.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks Tim
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
TimP (Intel) wrote:
It looks like you may have made the test big enough to exceed cache, so it might be expected to be bandwidth limited.
Did you take care to make at least the store aligned? If the loads are unaligned, and you don't have at least corei7-3, splitting them explicitly into 128-bit loads is expected to be faster (although maybe not worth the effort if you want readable intrinsics).
As you commented earlier, now that you have shown code excerpts, there is nothing here to produce better performance with AVX on current platforms.
I tried with other data sized, too. This did not have much influence. I reduced the data amount to 1 MB (which is a third of 3 MB L3 cache). There AVX got 3% more performance than SSE.
Sorry I forgot to mention, all data for all tests has been aligned to 32 bytes. I used a user defined allocator to assure this for STL vector container.
Would using prefetch increase performance a little bit? I think I saw an example therefore in the forum here, some time ago.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
iliyapolak wrote:
>>>And to the CRT sqrt function. Simply call std::sqrt, debug it and go to the disassembly window. Now do single steps. This way you can get into CRT assembler code. There is a lot of input checking done before sqrt is called>>>
Thanks for your advise,but I prefer to work with the help of IDA Pro disassembler and windbg.
I will check that tool. But I think it is not free.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sergey Kostrov wrote:
Thanks for the feedback and the test case!
>>...Should I test my test code? Or let me run your code in my machine? Maybe my test system with Sandy Bridge
>>i5-2410M is slower?I don't try to compromise your results and I simply would like to see that Intel's AVX provides advantages over SSE.
I think the best way to proceed is a consolidated set of tests ( SSE vs. AVX ) in a Visual Studio project and I'll take care of it. I'll upload the project as soon as it is ready.
I don not want to compromise your tests, too. It is just interesting that our results differ that much. At first I was supprised about my results. Later I just found the link I already provided. I am doing some research for a student research project. So I just wanted to assure my results go along with other people's result.
With an FIR filter for example I get quite good speedup of AVX: twice fast for double and nearly twice fast for float. But as soon as sqrt is involed in algorithm AVX might not get more than 10% of speed.
Please let me know I you opened another thread for the test you mentioned or post the link here.
Maybe CPU documentation gives some hints. One processor might have lower latency than another for same instruction.
// EDIT:
I could not find instruction latency information when I go to my processor and then to datasheets. There is volume 1 and 2 but none of them mentions something like instruction latency. I only found http://www.agner.org/optimize/instruction_tables.pdf some information there, but Ivy Bridge is not listed there.
// EDIT: Google is your friend, I think this is quite helpful: http://www.intel.com/content/dam/doc/manual/64-ia-32-architectures-optimization-manual.pdf
Here I get Latency and Throughput for different CPU families. And what is important: for AVX Ivy Bridge nearly has twice the throughput for division and square root compared to Sandy Bridge.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sergey Kostrov wrote:
Here are a couple of more links & tips:
- You need to look at Intel 64 and IA-32 Architectures Optimization Reference Manual, APPENDIX C, INSTRUCTION LATENCY AND THROUGHPUT
- Try to use msinfo32.exe utility ( it provides some CPU information )
- http://ark.intel.com -> http://ark.intel.com/products/52224/Intel-Core-i5-2410M-Processor-3M-Cac...
Note: Take a look at a datasheet for your i5-2410M CPU in a Quick Links section ( on the right side of the web page )
I tried to find it in the datasheets. There is Volume 1 and 2. But I could not find latency information or something related to instructions.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
if you have your project, please let me know. I help you to integrate my code and we can do a test on different plattforms.
Sound interesting about prefetch. As soon as I have time, I try to integrate this, too.
It is stilly fancy that you get an speedup of 6, which means AVX is 6 time faster. This is far away from my results. But we tested on other architectures. Ivy Bridge doubles througput for instruction and decreased latency.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello Sergey,
it's ok. I am really interested in your test strategy. I hope you see this message. The last time I often had troubles, because my posts gut stuck in the spam filter. I think you missed my last post, too.
Kind regards,
Christian
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi
Thanks for posting SQRT test cases source code.IIRC a few months ago one of the forum user advised to do not call a tested function with the constant value.Value should be pseudo-random in the proper input range.It can be interesting to see how such a approach will affect the speed of execution.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It might take one or two days more, to add this. The days have too little time.
But to give an short overview: I create a vector, fill it with random data. This part is not included in time meassure. Then I operate on it. After getting time, I pick random item and store it in a volatile variable. This way compiler does not optimize anything away, also in release config.
Would it be ok, to go to Visual Studio 2010 project? I work with VS2010 and VS2012 and have no 2008 installed?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
So here we go with the tests.
Unfortunately I see a problem. You use very high iteration counts. And now I create a vector containing all elements and operate on it. The same for the result. Maybe we should change this to a combination of vector with certain size and iterating above it. But then it is hard to get comparable results. Do you have some idea? We could create a vector with size of 1 MB to fit in L3 for sure. Then operate on all elements and repeat this to get on overall iteration count?
Feel free to make comments on the code.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page