Performance of sqrt - Page 4

Christian_M_2 · ‎02-01-2013

Hello,

I am using the intrinsic for square root. I know from the Optimization manual I could use reciprocal square root and aproximation algorithm. But I need the accuracy.

The thing is that AVX shows no improvement over SSE. Intrinsics guide gave me some hints. Is it true that the square root operation is not pipeling for both SSE and AVX? At least latency and througput indicte this. I mean AVX has twice data amount per operation but a double of latency and half of througput means all combined same performance? Is it so?

My testsystem is an i5-2410M. In the intrinsics guide (I updated to the newest version) I only find latency and througput for Sandy Bridge. Has performance of this commands improved in Ivy Bridge? Could anyone explain the CPUID(s) a little bit? 06_2A means Sandy Bridge or does it not? Does this account for all Sandy Bridge CPUs (regardless of Desktop or Mobile or i3, i5, i7)?

For CPUID(s) I found: http://software.intel.com/en-us/articles/intel-architecture-and-processor-identification-with-cpuid-model-and-family-numbers

Does the intrinsics guide refer to a combination of family and model number? What about model numbers not mentioned in the intrinsics guide like Ivy Bridge?

TimP · ‎02-08-2013

>>What code exactly differs from some settings. VS2010 for example generates FPU instructions for 32bit and skalar SSE for 64bit. Intel generate skalar SSE for both 32 and 64 bit>>>

At least since VS2005, there has been /arch:SSE2 option. VS2012 adds /arch:AVX option and limited use of parallel SSE2 or AVX instructions.

Bernard · ‎02-08-2013

Thanks Tim

Christian_M_2 · ‎02-08-2013

TimP (Intel) wrote:

It looks like you may have made the test big enough to exceed cache, so it might be expected to be bandwidth limited.

Did you take care to make at least the store aligned? If the loads are unaligned, and you don't have at least corei7-3, splitting them explicitly into 128-bit loads is expected to be faster (although maybe not worth the effort if you want readable intrinsics).

As you commented earlier, now that you have shown code excerpts, there is nothing here to produce better performance with AVX on current platforms.

I tried with other data sized, too. This did not have much influence. I reduced the data amount to 1 MB (which is a third of 3 MB L3 cache). There AVX got 3% more performance than SSE.

Sorry I forgot to mention, all data for all tests has been aligned to 32 bytes. I used a user defined allocator to assure this for STL vector container.

Would using prefetch increase performance a little bit? I think I saw an example therefore in the forum here, some time ago.

Christian_M_2 · ‎02-08-2013

iliyapolak wrote:

>>>And to the CRT sqrt function. Simply call std::sqrt, debug it and go to the disassembly window. Now do single steps. This way you can get into CRT assembler code. There is a lot of input checking done before sqrt is called>>>

Thanks for your advise,but I prefer to work with the help of IDA Pro disassembler and windbg.

I will check that tool. But I think it is not free.

Christian_M_2 · ‎02-08-2013

Sergey Kostrov wrote:

Thanks for the feedback and the test case!

>>...Should I test my test code? Or let me run your code in my machine? Maybe my test system with Sandy Bridge
>>i5-2410M is slower?

I don't try to compromise your results and I simply would like to see that Intel's AVX provides advantages over SSE.

I think the best way to proceed is a consolidated set of tests ( SSE vs. AVX ) in a Visual Studio project and I'll take care of it. I'll upload the project as soon as it is ready.

I don not want to compromise your tests, too. It is just interesting that our results differ that much. At first I was supprised about my results. Later I just found the link I already provided. I am doing some research for a student research project. So I just wanted to assure my results go along with other people's result.

With an FIR filter for example I get quite good speedup of AVX: twice fast for double and nearly twice fast for float. But as soon as sqrt is involed in algorithm AVX might not get more than 10% of speed.

Please let me know I you opened another thread for the test you mentioned or post the link here.

Maybe CPU documentation gives some hints. One processor might have lower latency than another for same instruction.

// EDIT:
I could not find instruction latency information when I go to my processor and then to datasheets. There is volume 1 and 2 but none of them mentions something like instruction latency. I only found http://www.agner.org/optimize/instruction_tables.pdf some information there, but Ivy Bridge is not listed there.

// EDIT: Google is your friend, I think this is quite helpful: http://www.intel.com/content/dam/doc/manual/64-ia-32-architectures-optimization-manual.pdf

Here I get Latency and Throughput for different CPU families. And what is important: for AVX Ivy Bridge nearly has twice the throughput for division and square root compared to Sandy Bridge.

Christian_M_2 · ‎02-08-2013

Sergey Kostrov wrote:

Here are a couple of more links & tips:

- You need to look at Intel 64 and IA-32 Architectures Optimization Reference Manual, APPENDIX C, INSTRUCTION LATENCY AND THROUGHPUT

- Try to use msinfo32.exe utility ( it provides some CPU information )

- http://ark.intel.com -> http://ark.intel.com/products/52224/Intel-Core-i5-2410M-Processor-3M-Cac...

Note: Take a look at a datasheet for your i5-2410M CPU in a Quick Links section ( on the right side of the web page )

- http://software.intel.com/en-us/forums/topic/278742

I tried to find it in the datasheets. There is Volume 1 and 2. But I could not find latency information or something related to instructions.

Bernard · ‎02-08-2013

>>>I will check that tool. But I think it is not free.>>> Full version of IDA is not free,but you can download stripped down version which is free,Windbg is free.

SergeyKostrov · ‎02-09-2013

>>I tried with other data sized, too. This did not have much influence. I reduced the data amount to 1 MB (which is a third of >>3 MB L3 cache). There AVX got 3% more performance than SSE. Thanks for the note, Christian. It matches to my results. Since you have a system with Sandy Bridge CPU and I have a system with Ivy Bridge CPU throughput is 1 to 2. Your 3% result has to be multiplied by 2 and we get 6, that is a performance improvement is 6%. Is that wrong? >>...Would using prefetch increase performance a little bit? I think I saw an example therefore in the forum here, some time ago. Yes, especially for data sets which are greater than 64KB and you will need to re-implement your main for loop. There are lots of posts on IDZ forums related to that subject and just enter _mm_prefetch in a search control. Please take a look at one of a recent threads related to prefetching ( there are some codes and tests data ): software.intel.com/en-us/forums/topic/352880.

SergeyKostrov · ‎02-09-2013

>>...I think the best way to proceed is a consolidated set of tests ( SSE vs. AVX ) in a Visual Studio project... Christian, I hope that on Monday I will upload a project with my tests. Then you will need to add your tests with STL containers, of course as soon as you have time. When everything is ready ( sources tuned, etc ) a set of new tests on our systems could be done and new results posted.

Bernard · ‎02-09-2013

>>>With an FIR filter for example I get quite good speedup of AVX: twice fast for double and nearly twice fast for float>>> This could be due to easily vectorized code and wider registers.

Christian_M_2 · ‎02-11-2013

Sergey,

if you have your project, please let me know. I help you to integrate my code and we can do a test on different plattforms.

Sound interesting about prefetch. As soon as I have time, I try to integrate this, too.

It is stilly fancy that you get an speedup of 6, which means AVX is 6 time faster. This is far away from my results. But we tested on other architectures. Ivy Bridge doubles througput for instruction and decreased latency.

SergeyKostrov · ‎02-11-2013

>>...It is stilly fancy that you get an speedup of 6, which means AVX is 6 time faster. This is far away from my results. But we >>tested on other architectures. Ivy Bridge doubles througput for instruction and decreased latency. Hi Christian, Out "test strategies" are different ( and that's OK! ) and you will see that in the test project soon. I'm not going to over-complicate and will keep it as simple as possible. Best regards, Sergey PS: Sorry for the delay ( too many different things for Monday... )

Christian_M_2 · ‎02-12-2013

Hello Sergey,

it's ok. I am really interested in your test strategy. I hope you see this message. The last time I often had troubles, because my posts gut stuck in the spam filter. I think you missed my last post, too.

Kind regards,

Christian

SergeyKostrov · ‎02-12-2013

Hi everybody, Please find attached a Visual Studio 2008 project ( it is a Professional Edition ) with Intel C++ compiler set by default. Tests for three sqrt-functions are currently implemented: - CRT sqrt - SSE sqrt ( intrinsic ) - AVX sqrt ( intrinsic ) Christian, please add your STL codes ( as soon as you have time ) and upload project for tests. Thanks in advance and let me know if you have any issues or questions. You're free to modify and improve codes. Best regards, Sergey

Bernard · ‎02-12-2013

Hi

Thanks for posting SQRT test cases source code.IIRC a few months ago one of the forum user advised to do not call a tested function with the constant value.Value should be pseudo-random in the proper input range.It can be interesting to see how such a approach will affect the speed of execution.

SergeyKostrov · ‎02-12-2013

>>...a few months ago one of the forum user advised to do not call a tested function with the constant value... Interations counter t could be also used. A call to rand CRT-function would create additional overhead unless it is done before main for loop. Anyway, it is not a problem to test and to see if results are different.

Bernard · ‎02-12-2013

>>>A call to rand CRT-function would create additional overhead unless it is done before main for loop>>> Calling rand and srand from within for loop is not applicable because of function call overhead.Also casting loop counter will incure some overhead.

Christian_M_2 · ‎02-14-2013

It might take one or two days more, to add this. The days have too little time.

But to give an short overview: I create a vector, fill it with random data. This part is not included in time meassure. Then I operate on it. After getting time, I pick random item and store it in a volatile variable. This way compiler does not optimize anything away, also in release config.

Would it be ok, to go to Visual Studio 2010 project? I work with VS2010 and VS2012 and have no 2008 installed?

SergeyKostrov · ‎02-14-2013

>>...It might take one or two days more, to add this. The days have too little time. Thanks for the update. >>...Would it be ok, to go to Visual Studio 2010 project? I work with VS2010 and VS2012 and have no 2008 installed?.. Yes. I will "port" the project back to Visual Studio 2008 ( opposite case... ).

SergeyKostrov · ‎02-14-2013

>>...But I think you get AVX to be a lot of faster then SSE... As of today I have two test cases and you have a 2nd one in VS 2008 project attached a couple of days ago. I see 3x difference on Ivy Bridge system ( application was compiled with Intel C++ compiler XE 2013 ) and I don't confirm 6x difference between SSE2 and AVX sqrt-calculations. Also, I consider that the 2nd case is better implemented. As Tim noted you could have some negative impact related to cache lines and I think you need to use VTune to analyse your processing on Sandy Bridge.

Christian_M_2 · ‎02-16-2013

So here we go with the tests.

Unfortunately I see a problem. You use very high iteration counts. And now I create a vector containing all elements and operate on it. The same for the result. Maybe we should change this to a combination of vector with certain size and iterating above it. But then it is hard to get comparable results. Do you have some idea? We could create a vector with size of 1 MB to fit in L3 for sure. Then operate on all elements and repeat this to get on overall iteration count?

Feel free to make comments on the code.