Intel® ISA Extensions
Use hardware-based isolation and memory encryption to provide more code protection in your solutions.
1122 Discussions

_mm_rsqrt_ps and intel architecture code analyzer

Ravi_Managuli
Beginner
2,630 Views
I have written a code to do sqrt using _mm_rsqrt_ps. When I use iaca -arch nehalem to run this code, it shows _mm_rsqrt_ps is executed on port 1, while most places I have seen (includingintel 64 and ia-32 architecture optimization reference manual) mentions sqrt is on port 0. Is this right? Is there a document which explicitly mentions where each SIMd instructions are assigned to which port?
Thanks
0 Kudos
19 Replies
Brijender_B_Intel
2,630 Views
I think you are doing reciporal of square root and not square root. sqrt andrsqrt are two differnet instructions. iaca-arch tool is right about the port binding. I beleive rsqrt goes on port one. And as you read in documentation sqrt goes on port 0.
i dont think there is a complete list anywhere to instructions and port mapping.
0 Kudos
Ravi_Managuli
Beginner
2,630 Views
Thanks.
Yes. I am doing rsqrt. In the intel optimization reference manual (Table 2-6, page 2-26), On Port 1, I only see FP_ADD. every other FP operations are listed under port 0 (includingDIV and SQRT). Is it because rsqrt is implemented using FP additions (internally) that we should assume rsqrt is on port 1?
-ravi
0 Kudos
Brijender_B_Intel
2,630 Views
No. I wont assume that.
0 Kudos
Ravi_Managuli
Beginner
2,630 Views
So, how did you know that it belongs to Port 1? I could not locate in the document.
Thanks
0 Kudos
Brijender_B_Intel
2,630 Views
"iaca" told you. Didnt? or you dont beleive the tool.
0 Kudos
Ravi_Managuli
Beginner
2,630 Views
Yes. iaca told me. But it conflicted with what was in the document. That is what I am trying toclarify and confirm. The real performance I get does not match with the performance iaca is predicting it to be. May be it is because of compiler issue (that it cannot optimize the code very well).
If iaca is accurate, then I have no problem in faithfully following it.
0 Kudos
Brijender_B_Intel
2,630 Views

Hi Ravi,
As i said in previous post. Document is incomplete as it is not listing all the instructions and it is not written to list all this information as it can change from processor to processor. Tool is giving the complete listing.
Regarding your performance gap, you may need to look into more detail. if you are comfortable with code sharing you can post here and someone can tell you why there is a performance gap.
Try differnet compiler, may intel compiler if you are not using that one. it will point out if this is due to compiler.

0 Kudos
Ravi_Managuli
Beginner
2,630 Views
Thanks. I am clear now regarding iaca. I am using intel compiler (10.2). I will try the new one.
0 Kudos
Ravi_Managuli
Beginner
2,630 Views
Here is the code doing square root. Just a simple function which is doingsquareroot. Thanks for help.
0 Kudos
Brijender_B_Intel
2,630 Views
Hi Ravi,
I just checked that your code is getting very good performance. i liked the way you unrolled it. What was your expectation? What are you getting?
0 Kudos
Ravi_Managuli
Beginner
2,630 Views
Thanks.
I estimation based upon instruction count (using IACA) wasthroughputof 22 cycles for 16 sqrts (the loop).
The performance I get is 42 cycles of throughput.
Almost 2x slower. That is what I am trying to understand how I can improve it further. How much do you get?
0 Kudos
Brijender_B_Intel
2,630 Views
i do understand IACA but how did you measure your performance 42cycles. can you please elaborate that?
You are port 0 and port 1 limited.
0 Kudos
Ravi_Managuli
Beginner
2,630 Views
I measured time using #include "tbb/tick_count.h". This gives me performance in seconds.
Then Idividedit by the number of square roots. I converted time for 16squareroots into cycles (since I know the clock frequency of my processor 2.67 GHz).
IACA is giving me 22 cycles for every 16 sqrts. I compared these cycles to the cycles obtained using above method.
0 Kudos
Brijender_B_Intel
2,630 Views
First thing you want to check that IACA is running for same architecture as your machine is. You can check it when you run it. it will print architecture over there. if IACA is running for latest architecture then there is difference in performance when you running on your machine.
Assuming those are right. IACA is telling you optimal throughput but you want to collect more from IACA with -analysis PERFORMANCE or -analysis DATA_DEPENDENCY.
You want to see the instructions marked as "CP" these are the instructions on critical path.
These analysis will print Latency in beginning for each port. Which will give you a little idea about how many cycles are taking place for one loop.
Secondly, looking at your code, your performance is limited by port 1 and also port0. if you can somehow break that dependency or choosing different instructions you may get better performance. You may need to look at the assembly generated - as you are using lot of registers there may be chance of register spills to stack. That will add more delay. you want to avoid it by reusing some already defined registers. Compilers usually take care of it but sometime compiler is not clear about the scope and keep the register alive little longer.
0 Kudos
Ravi_Managuli
Beginner
2,630 Views
Thanks. I am using visual studio 10 with intel compiler. How can I check assembly generated code in this setting?
0 Kudos
Brijender_B_Intel
2,630 Views
You need to compile individual file with /FA or /FAcs settings in output section of project properties. it will put assembly files or .cod file in your release/debug folder. open those files and you can see if instructions starts with "V" or not.
other option is SDE if you have installed. it has a tool got xed which dumps disassembly. i beleive it does show also that instruction is AVX or SSE.
0 Kudos
Ravi_Managuli
Beginner
2,630 Views
Idownloadedintel 11.1 compiler to run AVX as you suggested. I am not able to figure out how to compile in in visual studio 2008 with AVX instruction. Where should I set the flag to use AVX instruction set?

If I set /arch:AVX it says AVX architecture not found.

Should I set /QxAVX compiler option? How should I set it?

Thanks
0 Kudos
TimP
Honored Contributor III
2,630 Views
In Intel specific optimization properties you should have available /QxAVX. /arch doesn't include Intel specific optimizations; only /arch:SSE2 and SSE3 are available. /arch:AVX is planned but may not be fully implemented yet.
0 Kudos
themidimann
Beginner
2,630 Views
Thanks for covering this guys. I just got home and sat down trying to figure this out, and found this thread. Great stuff. Thanks again.

0 Kudos
Reply