I am maintaining some tests and building others to measure instruction thoughput. I noted starting with Core2Duo/Merom and continuing with Nehalem that obtaining more than 1 instr per cycle of throughput for instructions like:
is not trivial on Core2 and iCore platforms. Today I built a simple loop which repeatedly executes xorps to see if I can get a throughput of 3 as documented on the NH opt guide on a processor with model 0x1a and family 0x06. That code is below... but I've only gotten a throughput of 1 xorps per cycle.
There's documentation in the NH Opt guide about breaking dependencies.. using xorps. Can somone explain why I observe this behavior and give me an explanation as to why? Is the register file not fully ported for reads and if a read must be read from the register file the throughput is limited to 1, otherwise if caught within the reservation station the throughput is 3.
Any explanations are very helpful and insightful. Thanks ...
[bash]xorpsthr( &nloops, &uint_rdtsc_eax_edx, &intarray)
.text .global xorpsthr_ xorpsthr_: push %rbp push %rbx push %r12 push %r13 push %r14 push %r15 sub $56,%rsp mov %rdx,%rbp rdtsc movl %eax,(%rsi) movl %edx,4(%rsi) movl (%rdi),%ecx movl %ecx,(%rsp) movl %ecx,48(%rsp) xorps %xmm4,%xmm4 xorps %xmm5,%xmm5 xorps %xmm6,%xmm6 xorps %xmm7,%xmm7 xorps %xmm8,%xmm8 xorps %xmm9,%xmm9 loop: xorps %xmm4,%xmm4 xorps %xmm5,%xmm5 xorps %xmm6,%xmm6 xorps %xmm7,%xmm7 xorps %xmm8,%xmm8 xorps %xmm9,%xmm9 xorps %xmm4,%xmm4 xorps %xmm5,%xmm5 xorps %xmm6,%xmm6 xorps %xmm7,%xmm7 xorps %xmm8,%xmm8 xorps %xmm9,%xmm9 movl $12,4(%rsp) decl 48(%rsp) jnz loop movl (%rsp),%eax imul 4(%rsp),%eax movl %eax,16(%rsi) rdtsc movl %eax,8(%rsi) movl %edx,12(%rsi) add $56,%rsp pop %r15 pop %r14 pop %r13 pop %r12 pop %rbx pop %rbp ret
I am not sure how you measured performance. Did you try IACA tool (you can download form software.intel.com/en-us/avx). it prints throughput and performance information. One thing i will point, each instruction has two performance characteristics - latency and throughput. If you are scheduling same instruction multiple time without any other instruction inbetween you are putting lot of pressure on one port (where ever that instructoin get executed). In this case i beleive latency will also play its part.
Even though processor has multiple ports but it does not mean any instruction can execute on that ports. instructions are bound to the ports. The numbers mentioned on perfromance manual are more for realistic Application where these instructions execute in parallel with other instructions. it is hard to find a use case where same instruction is executing 100times.
[bash]add,r-r,8b , 1,2.91 , ,16b, 1,1.45 , ,32b, 1,2.92 , ,64b, 1,2.91 adc,r-r,8b , 2,0.50 , ,16b, 2,0.50 , ,32b, 2,0.50 , ,64b, 2,0.50 and,r-r,8b , 1,1.76 , ,16b, 1,1.45 , ,32b, 1,2.91 , ,64b, 1,2.92
or,r-r,8b , 1,1.76
, ,16b, 1,1.45
, ,32b, 1,2.92
, ,64b, 1,2.92
xor,r-r,8b , 1,1.76
, ,16b, 1,1.45
, ,32b, 1,2.92
, ,64b, 1,2.92[/bash]
[bash]LOGICAL,FP ,orps , 1,0.98 , ,orpd , 1,0.98 , ,andps , 1,0.98 , ,andpd , 1,0.98 , ,xorps , 1,0.98 , ,xorpd , 1,0.98 , ,andnps, 1,0.98 , ,andnpd, 1,0.98 ,INT,pand , 1,1.22 , ,pandn , 1,1.24 , ,por , 1,1.24 , ,pxor , 1,1.23 [/bash]Now on an AMD Phenom I get for ALU latency/throughput:
[bash]add,r-r,8b , 1,2.90 , ,16b, 1,2.91 , ,32b, 1,2.89 , ,64b, 1,2.91 and,r-r,8b, 1,2.91 , ,16b, 1,2.91 , ,32b, 1,2.90 , ,64b, 1,2.89 or,r-r,8b, 1,2.85 , ,16b, 1,2.91 , ,32b, 1,2.91 , ,64b, 1,2.87 xor,r-r,8b, 1,2.89 , ,16b, 1,2.91 , ,32b, 1,2.91 , ,64b, 1,2.89 [/bash]
[bash]LOGICAL,FP ,orps , 2,1.99 , ,orpd , 2,1.98 , ,andps , 2,1.98 , ,andpd , 2,1.96 , ,xorps , 2,1.99 , ,xorpd , 2,1.98 , ,andnps, 2,1.98 , ,andnpd, 2,1.99 [/bash]