I am maintaining some tests and building others to measure instruction thoughput. I noted starting with Core2Duo/Merom and continuing with Nehalem that obtaining more than 1 instr per cycle of throughput for instructions like:
is not trivial on Core2 and iCore platforms. Today I built a simple loop which repeatedly executes xorps to see if I can get a throughput of 3 as documented on the NH opt guide on a processor with model 0x1a and family 0x06. That code is below... but I've only gotten a throughput of 1 xorps per cycle.
There's documentation in the NH Opt guide about breaking dependencies.. using xorps. Can somone explain why I observe this behavior and give me an explanation as to why? Is the register file not fully ported for reads and if a read must be read from the register file the throughput is limited to 1, otherwise if caught within the reservation station the throughput is 3.
Any explanations are very helpful and insightful. Thanks ...
[bash]xorpsthr( &nloops, &uint_rdtsc_eax_edx, &intarray)
.text .global xorpsthr_ xorpsthr_: push %rbp push %rbx push %r12 push %r13 push %r14 push %r15 sub $56,%rsp mov %rdx,%rbp rdtsc movl %eax,(%rsi) movl %edx,4(%rsi) movl (%rdi),%ecx movl %ecx,(%rsp) movl %ecx,48(%rsp) xorps %xmm4,%xmm4 xorps %xmm5,%xmm5 xorps %xmm6,%xmm6 xorps %xmm7,%xmm7 xorps %xmm8,%xmm8 xorps %xmm9,%xmm9 loop: xorps %xmm4,%xmm4 xorps %xmm5,%xmm5 xorps %xmm6,%xmm6 xorps %xmm7,%xmm7 xorps %xmm8,%xmm8 xorps %xmm9,%xmm9 xorps %xmm4,%xmm4 xorps %xmm5,%xmm5 xorps %xmm6,%xmm6 xorps %xmm7,%xmm7 xorps %xmm8,%xmm8 xorps %xmm9,%xmm9 movl $12,4(%rsp) decl 48(%rsp) jnz loop movl (%rsp),%eax imul 4(%rsp),%eax movl %eax,16(%rsi) rdtsc movl %eax,8(%rsi) movl %edx,12(%rsi) add $56,%rsp pop %r15 pop %r14 pop %r13 pop %r12 pop %rbx pop %rbp ret
I observe this for all logical instructions and for various unrollings of the loops above. Upon core2 platforms I was able to observe increased throughputs by removing dependencies by using:
but on icore this is no longer effective. Because of this variance in the behavior of core2 -> icore.. I'd like to understand the mechanism responsible for this.
Thanks to anyone for some insight..
I am not sure how you measured performance. Did you try IACA tool (you can download form software.intel.com/en-us/avx). it prints throughput and performance information. One thing i will point, each instruction has two performance characteristics - latency and throughput. If you are scheduling same instruction multiple time without any other instruction inbetween you are putting lot of pressure on one port (where ever that instructoin get executed). In this case i beleive latency will also play its part.
Even though processor has multiple ports but it does not mean any instruction can execute on that ports. instructions are bound to the ports. The numbers mentioned on perfromance manual are more for realistic Application where these instructions execute in parallel with other instructions. it is hard to find a use case where same instruction is executing 100times.
1) Each register file entry has R read ports and W write ports.
2) Scheduling in the chip schedules based upon pipes and each pipe services some unit, as discussed in the Intel/AMD optguides.
3) of course each instruction has lat/thr associated with it.. but you also assume there no random but rather deterministic behavior to tests. Randomness in results implies some mis-understood underlying mechanism.. which is what I'm describing.
4) there are other tools written by Agner Fog, Everest and others. These are useful for people who asesmbly optimize and try to get the most from their investment in CPU/ISV tools.
So.. there's no problem executing 5 ANDPS instructions in a row, so long as there are no dependencies in each instruction based upon the previous instructions. There can be an issue with only 1 pipe servicing that instruction/uop, which would restrict performance and be an observation. However this is not the case since in the NH opt guide it outlines that 3 logical operations can be performed per cycle on NH. So no issue with pipe assignment.
Those instructions have to snarf data from the register file from read ports or get data bypassed to a reservation station. Data will always be written to the register file. Scheduling of operations is based upon arguments being available. What I'm asking is.. is there an issue or mechanism responsible for XORPS, ANDPS, etc. uops not executing because they can not get their argument data?
The only thing I can think of is that the register file read ports are not fully ported so as to support reads from any pipe's unit. It could be that if you have 3 pipes and each pipe has 2 read ports you need 6 R ports to each register file entry. Maybe this isn't supported and there's some scheduling huristic that takes place which prevents over subscription of the read ports. Maybe this is why I've seen this behavior since Core2 and now iCore chips. Expanding ports into the register file is very difficult and it would make sense I believe from a timing perspective.. possibly from a power one as well.
If you are at high IPC.. you presume you're getting instructions quickly queued up in the scheduler of the FP/INT and you are able to snarf data from bypass to the execution unit. If you're at a low IPC you may be hindered by not having a fully ported R ports to the register file but hey.. you are likely limited elsewhere earlier.. possibly by cache misses (think prefetch) or memory bandwidth (prefetch + Uncore capacity).
I hope what I'm stating makes sense.. please correct me if I'm wrong. My tests clearly illustrate the latency / throughput of AMD platforms, which have a fully ported register file for R/W to each pipe's units. On Intel, I don't observe the prescribed throughput and yes, my tests are contrived, but compiler writiers/optimizers would like to understand the mechanisms which will speedup/slow their code down. And that's the purpose of this question.
Here's an example of the ALU latency/throughput on NH:
[bash]add,r-r,8b , 1,2.91 , ,16b, 1,1.45 , ,32b, 1,2.92 , ,64b, 1,2.91 adc,r-r,8b , 2,0.50 , ,16b, 2,0.50 , ,32b, 2,0.50 , ,64b, 2,0.50 and,r-r,8b , 1,1.76 , ,16b, 1,1.45 , ,32b, 1,2.91 , ,64b, 1,2.92
or,r-r,8b , 1,1.76
, ,16b, 1,1.45
, ,32b, 1,2.92
, ,64b, 1,2.92
xor,r-r,8b , 1,1.76
, ,16b, 1,1.45
, ,32b, 1,2.92
, ,64b, 1,2.92[/bash]
So no problems here, we can do 3 ALU logical operations.. no problem. Now NH SIMD LOGICAL operations:
[bash]LOGICAL,FP ,orps , 1,0.98 , ,orpd , 1,0.98 , ,andps , 1,0.98 , ,andpd , 1,0.98 , ,xorps , 1,0.98 , ,xorpd , 1,0.98 , ,andnps, 1,0.98 , ,andnpd, 1,0.98 ,INT,pand , 1,1.22 , ,pandn , 1,1.24 , ,por , 1,1.24 , ,pxor , 1,1.23 [/bash]Now on an AMD Phenom I get for ALU latency/throughput:
[bash]add,r-r,8b , 1,2.90 , ,16b, 1,2.91 , ,32b, 1,2.89 , ,64b, 1,2.91 and,r-r,8b, 1,2.91 , ,16b, 1,2.91 , ,32b, 1,2.90 , ,64b, 1,2.89 or,r-r,8b, 1,2.85 , ,16b, 1,2.91 , ,32b, 1,2.91 , ,64b, 1,2.87 xor,r-r,8b, 1,2.89 , ,16b, 1,2.91 , ,32b, 1,2.91 , ,64b, 1,2.89 [/bash]
and for FP SIMD LOGICAL:
[bash]LOGICAL,FP ,orps , 2,1.99 , ,orpd , 2,1.98 , ,andps , 2,1.98 , ,andpd , 2,1.96 , ,xorps , 2,1.99 , ,xorpd , 2,1.98 , ,andnps, 2,1.98 , ,andnpd, 2,1.99 [/bash]
This is crux of my question. These tests have been very useful to myself in the past. As stated previously there are other applications which measure latency/throughput, but my tests are more precise and I understand *precisely* what they measure instead of relying upon some 3rd party vendor tool to tell me and who knows what assumptions are made there. My tool makes no assumptions.. which is especially valuable for determining different INT/FP load latencies and FP<->GPR latencies.