- Marcar como novo
- Marcador
- Subscrever
- Silenciar
- Subscrever fonte RSS
- Destacar
- Imprimir
- Denunciar conteúdo inapropriado
I have written a code to do sqrt using _mm_rsqrt_ps. When I use iaca -arch nehalem to run this code, it shows _mm_rsqrt_ps is executed on port 1, while most places I have seen (includingintel 64 and ia-32 architecture optimization reference manual) mentions sqrt is on port 0. Is this right? Is there a document which explicitly mentions where each SIMd instructions are assigned to which port?
Thanks
Link copiado
19 Respostas
- Marcar como novo
- Marcador
- Subscrever
- Silenciar
- Subscrever fonte RSS
- Destacar
- Imprimir
- Denunciar conteúdo inapropriado
I think you are doing reciporal of square root and not square root. sqrt andrsqrt are two differnet instructions. iaca-arch tool is right about the port binding. I beleive rsqrt goes on port one. And as you read in documentation sqrt goes on port 0.
i dont think there is a complete list anywhere to instructions and port mapping.
i dont think there is a complete list anywhere to instructions and port mapping.
- Marcar como novo
- Marcador
- Subscrever
- Silenciar
- Subscrever fonte RSS
- Destacar
- Imprimir
- Denunciar conteúdo inapropriado
Thanks.
Yes. I am doing rsqrt. In the intel optimization reference manual (Table 2-6, page 2-26), On Port 1, I only see FP_ADD. every other FP operations are listed under port 0 (includingDIV and SQRT). Is it because rsqrt is implemented using FP additions (internally) that we should assume rsqrt is on port 1?-ravi
- Marcar como novo
- Marcador
- Subscrever
- Silenciar
- Subscrever fonte RSS
- Destacar
- Imprimir
- Denunciar conteúdo inapropriado
No. I wont assume that.
- Marcar como novo
- Marcador
- Subscrever
- Silenciar
- Subscrever fonte RSS
- Destacar
- Imprimir
- Denunciar conteúdo inapropriado
So, how did you know that it belongs to Port 1? I could not locate in the document.
Thanks
- Marcar como novo
- Marcador
- Subscrever
- Silenciar
- Subscrever fonte RSS
- Destacar
- Imprimir
- Denunciar conteúdo inapropriado
"iaca" told you. Didnt? or you dont beleive the tool.
- Marcar como novo
- Marcador
- Subscrever
- Silenciar
- Subscrever fonte RSS
- Destacar
- Imprimir
- Denunciar conteúdo inapropriado
Yes. iaca told me. But it conflicted with what was in the document. That is what I am trying toclarify and confirm. The real performance I get does not match with the performance iaca is predicting it to be. May be it is because of compiler issue (that it cannot optimize the code very well).
If iaca is accurate, then I have no problem in faithfully following it.
- Marcar como novo
- Marcador
- Subscrever
- Silenciar
- Subscrever fonte RSS
- Destacar
- Imprimir
- Denunciar conteúdo inapropriado
Hi Ravi,
As i said in previous post. Document is incomplete as it is not listing all the instructions and it is not written to list all this information as it can change from processor to processor. Tool is giving the complete listing.
Regarding your performance gap, you may need to look into more detail. if you are comfortable with code sharing you can post here and someone can tell you why there is a performance gap.
Try differnet compiler, may intel compiler if you are not using that one. it will point out if this is due to compiler.
- Marcar como novo
- Marcador
- Subscrever
- Silenciar
- Subscrever fonte RSS
- Destacar
- Imprimir
- Denunciar conteúdo inapropriado
Thanks. I am clear now regarding iaca. I am using intel compiler (10.2). I will try the new one.
- Marcar como novo
- Marcador
- Subscrever
- Silenciar
- Subscrever fonte RSS
- Destacar
- Imprimir
- Denunciar conteúdo inapropriado
- Marcar como novo
- Marcador
- Subscrever
- Silenciar
- Subscrever fonte RSS
- Destacar
- Imprimir
- Denunciar conteúdo inapropriado
Hi Ravi,
I just checked that your code is getting very good performance. i liked the way you unrolled it. What was your expectation? What are you getting?
I just checked that your code is getting very good performance. i liked the way you unrolled it. What was your expectation? What are you getting?
- Marcar como novo
- Marcador
- Subscrever
- Silenciar
- Subscrever fonte RSS
- Destacar
- Imprimir
- Denunciar conteúdo inapropriado
Thanks.
I estimation based upon instruction count (using IACA) wasthroughputof 22 cycles for 16 sqrts (the loop).
The performance I get is 42 cycles of throughput.
Almost 2x slower. That is what I am trying to understand how I can improve it further. How much do you get?
- Marcar como novo
- Marcador
- Subscrever
- Silenciar
- Subscrever fonte RSS
- Destacar
- Imprimir
- Denunciar conteúdo inapropriado
i do understand IACA but how did you measure your performance 42cycles. can you please elaborate that?
You are port 0 and port 1 limited.
You are port 0 and port 1 limited.
- Marcar como novo
- Marcador
- Subscrever
- Silenciar
- Subscrever fonte RSS
- Destacar
- Imprimir
- Denunciar conteúdo inapropriado
I measured time using #include "tbb/tick_count.h". This gives me performance in seconds.
Then Idividedit by the number of square roots. I converted time for 16squareroots into cycles (since I know the clock frequency of my processor 2.67 GHz).
IACA is giving me 22 cycles for every 16 sqrts. I compared these cycles to the cycles obtained using above method.
- Marcar como novo
- Marcador
- Subscrever
- Silenciar
- Subscrever fonte RSS
- Destacar
- Imprimir
- Denunciar conteúdo inapropriado
First thing you want to check that IACA is running for same architecture as your machine is. You can check it when you run it. it will print architecture over there. if IACA is running for latest architecture then there is difference in performance when you running on your machine.
Assuming those are right. IACA is telling you optimal throughput but you want to collect more from IACA with -analysis PERFORMANCE or -analysis DATA_DEPENDENCY.
You want to see the instructions marked as "CP" these are the instructions on critical path.
These analysis will print Latency in beginning for each port. Which will give you a little idea about how many cycles are taking place for one loop.
Secondly, looking at your code, your performance is limited by port 1 and also port0. if you can somehow break that dependency or choosing different instructions you may get better performance. You may need to look at the assembly generated - as you are using lot of registers there may be chance of register spills to stack. That will add more delay. you want to avoid it by reusing some already defined registers. Compilers usually take care of it but sometime compiler is not clear about the scope and keep the register alive little longer.
Assuming those are right. IACA is telling you optimal throughput but you want to collect more from IACA with -analysis PERFORMANCE or -analysis DATA_DEPENDENCY.
You want to see the instructions marked as "CP" these are the instructions on critical path.
These analysis will print Latency in beginning for each port. Which will give you a little idea about how many cycles are taking place for one loop.
Secondly, looking at your code, your performance is limited by port 1 and also port0. if you can somehow break that dependency or choosing different instructions you may get better performance. You may need to look at the assembly generated - as you are using lot of registers there may be chance of register spills to stack. That will add more delay. you want to avoid it by reusing some already defined registers. Compilers usually take care of it but sometime compiler is not clear about the scope and keep the register alive little longer.
- Marcar como novo
- Marcador
- Subscrever
- Silenciar
- Subscrever fonte RSS
- Destacar
- Imprimir
- Denunciar conteúdo inapropriado
Thanks. I am using visual studio 10 with intel compiler. How can I check assembly generated code in this setting?
- Marcar como novo
- Marcador
- Subscrever
- Silenciar
- Subscrever fonte RSS
- Destacar
- Imprimir
- Denunciar conteúdo inapropriado
You need to compile individual file with /FA or /FAcs settings in output section of project properties. it will put assembly files or .cod file in your release/debug folder. open those files and you can see if instructions starts with "V" or not.
other option is SDE if you have installed. it has a tool got xed which dumps disassembly. i beleive it does show also that instruction is AVX or SSE.
other option is SDE if you have installed. it has a tool got xed which dumps disassembly. i beleive it does show also that instruction is AVX or SSE.
- Marcar como novo
- Marcador
- Subscrever
- Silenciar
- Subscrever fonte RSS
- Destacar
- Imprimir
- Denunciar conteúdo inapropriado
Idownloadedintel 11.1 compiler to run AVX as you suggested. I am not able to figure out how to compile in in visual studio 2008 with AVX instruction. Where should I set the flag to use AVX instruction set?
If I set /arch:AVX it says AVX architecture not found.
Should I set /QxAVX compiler option? How should I set it?
Thanks
- Marcar como novo
- Marcador
- Subscrever
- Silenciar
- Subscrever fonte RSS
- Destacar
- Imprimir
- Denunciar conteúdo inapropriado
In Intel specific optimization properties you should have available /QxAVX. /arch doesn't include Intel specific optimizations; only /arch:SSE2 and SSE3 are available. /arch:AVX is planned but may not be fully implemented yet.
- Marcar como novo
- Marcador
- Subscrever
- Silenciar
- Subscrever fonte RSS
- Destacar
- Imprimir
- Denunciar conteúdo inapropriado

Responder
Opções do tópico
- Subscrever fonte RSS
- Marcar tópico como novo
- Marcar tópico como lido
- Flutuar este Tópico para o utilizador atual
- Marcador
- Subscrever
- Página amigável para impressora