SIMD instruction thoughput observations..

perfwise · ‎08-19-2010

I am maintaining some tests and building others to measure instruction thoughput. I noted starting with Core2Duo/Merom and continuing with Nehalem that obtaining more than 1 instr per cycle of throughput for instructions like:

xorps
andps
pand
por
etc...

is not trivial on Core2 and iCore platforms. Today I built a simple loop which repeatedly executes xorps to see if I can get a throughput of 3 as documented on the NH opt guide on a processor with model 0x1a and family 0x06. That code is below... but I've only gotten a throughput of 1 xorps per cycle.

There's documentation in the NH Opt guide about breaking dependencies.. using xorps. Can somone explain why I observe this behavior and give me an explanation as to why? Is the register file not fully ported for reads and if a read must be read from the register file the throughput is limited to 1, otherwise if caught within the reservation station the throughput is 3.

Any explanations are very helpful and insightful. Thanks ...

perfwise

[bash]xorpsthr( &nloops, &uint_rdtsc_eax_edx, &intarray)






.text
.global xorpsthr_
xorpsthr_:
push    %rbp
push    %rbx
push    %r12
push    %r13
push    %r14
push    %r15
sub    $56,%rsp
 
mov     %rdx,%rbp
rdtsc
movl   %eax,(%rsi)
movl   %edx,4(%rsi)
movl    (%rdi),%ecx
movl    %ecx,(%rsp)
movl    %ecx,48(%rsp)
xorps %xmm4,%xmm4
xorps %xmm5,%xmm5
xorps %xmm6,%xmm6
xorps %xmm7,%xmm7
xorps %xmm8,%xmm8
xorps %xmm9,%xmm9
loop:
xorps   	%xmm4,%xmm4
xorps	%xmm5,%xmm5
xorps	%xmm6,%xmm6
xorps	%xmm7,%xmm7
xorps	%xmm8,%xmm8
xorps	%xmm9,%xmm9
xorps   	%xmm4,%xmm4
xorps   	%xmm5,%xmm5
xorps   	%xmm6,%xmm6
xorps	%xmm7,%xmm7
xorps	%xmm8,%xmm8
xorps	%xmm9,%xmm9
movl $12,4(%rsp)
decl   48(%rsp)
jnz loop
movl   (%rsp),%eax
imul   4(%rsp),%eax
movl   %eax,16(%rsi)
rdtsc
movl   %eax,8(%rsi)
movl   %edx,12(%rsi)
add    $56,%rsp
pop    %r15
pop    %r14
pop    %r13
pop    %r12
pop    %rbx
pop    %rbp
ret



[/bash]

perfwise · ‎08-20-2010

The behavior I note above, where the throughput of repeated sequences of instructions is much less than document, surely has an explanation. Someone on the performance team at Intel surely is aware of this if they have tried to directly measure the throughput of SIMD INT instructions. I'm simply asking why it occurs. If nobody is aware of this then I'm surprised that nobody has written tests which illustrate these anamolies in the throughput upon core2 and icore processors.

I observe this for all logical instructions and for various unrollings of the loops above. Upon core2 platforms I was able to observe increased throughputs by removing dependencies by using:

MOVAPS %xmm0,%xmm4
MOVAPS %xmm0,%xmm5
MOVAPS %xmm0,%xmm6
MOVAPS %xmm0,%xmm7
MOVAPS %xmm0,%xmm8
MOVAPS %xmm0,%xmm9

but on icore this is no longer effective. Because of this variance in the behavior of core2 -> icore.. I'd like to understand the mechanism responsible for this.

Thanks to anyone for some insight..

perfwise

Brijender_B_Intel · ‎08-23-2010

I am not sure how you measured performance. Did you try IACA tool (you can download form software.intel.com/en-us/avx). it prints throughput and performance information. One thing i will point, each instruction has two performance characteristics - latency and throughput. If you are scheduling same instruction multiple time without any other instruction inbetween you are putting lot of pressure on one port (where ever that instructoin get executed). In this case i beleive latency will also play its part.
Even though processor has multiple ports but it does not mean any instruction can execute on that ports. instructions are bound to the ports. The numbers mentioned on perfromance manual are more for realistic Application where these instructions execute in parallel with other instructions. it is hard to find a use case where same instruction is executing 100times.

perfwise · ‎08-23-2010

Well.. I think we are confusing a couple things. Excuse me if what I following state is incorrect.

1) Each register file entry has R read ports and W write ports.

2) Scheduling in the chip schedules based upon pipes and each pipe services some unit, as discussed in the Intel/AMD optguides.

3) of course each instruction has lat/thr associated with it.. but you also assume there no random but rather deterministic behavior to tests. Randomness in results implies some mis-understood underlying mechanism.. which is what I'm describing.

4) there are other tools written by Agner Fog, Everest and others. These are useful for people who asesmbly optimize and try to get the most from their investment in CPU/ISV tools.

So.. there's no problem executing 5 ANDPS instructions in a row, so long as there are no dependencies in each instruction based upon the previous instructions. There can be an issue with only 1 pipe servicing that instruction/uop, which would restrict performance and be an observation. However this is not the case since in the NH opt guide it outlines that 3 logical operations can be performed per cycle on NH. So no issue with pipe assignment.

Those instructions have to snarf data from the register file from read ports or get data bypassed to a reservation station. Data will always be written to the register file. Scheduling of operations is based upon arguments being available. What I'm asking is.. is there an issue or mechanism responsible for XORPS, ANDPS, etc. uops not executing because they can not get their argument data?

The only thing I can think of is that the register file read ports are not fully ported so as to support reads from any pipe's unit. It could be that if you have 3 pipes and each pipe has 2 read ports you need 6 R ports to each register file entry. Maybe this isn't supported and there's some scheduling huristic that takes place which prevents over subscription of the read ports. Maybe this is why I've seen this behavior since Core2 and now iCore chips. Expanding ports into the register file is very difficult and it would make sense I believe from a timing perspective.. possibly from a power one as well.

If you are at high IPC.. you presume you're getting instructions quickly queued up in the scheduler of the FP/INT and you are able to snarf data from bypass to the execution unit. If you're at a low IPC you may be hindered by not having a fully ported R ports to the register file but hey.. you are likely limited elsewhere earlier.. possibly by cache misses (think prefetch) or memory bandwidth (prefetch + Uncore capacity).

I hope what I'm stating makes sense.. please correct me if I'm wrong. My tests clearly illustrate the latency / throughput of AMD platforms, which have a fully ported register file for R/W to each pipe's units. On Intel, I don't observe the prescribed throughput and yes, my tests are contrived, but compiler writiers/optimizers would like to understand the mechanisms which will speedup/slow their code down. And that's the purpose of this question.

Here's an example of the ALU latency/throughput on NH:

[bash]add,r-r,8b , 1,2.91
   ,   ,16b, 1,1.45
   ,   ,32b, 1,2.92
   ,   ,64b, 1,2.91
adc,r-r,8b , 2,0.50
   ,   ,16b, 2,0.50
   ,   ,32b, 2,0.50
   ,   ,64b, 2,0.50
and,r-r,8b , 1,1.76
   ,   ,16b, 1,1.45
   ,   ,32b, 1,2.91
   ,   ,64b, 1,2.92
or,r-r,8b , 1,1.76
 , ,16b, 1,1.45
 , ,32b, 1,2.92
 , ,64b, 1,2.92
xor,r-r,8b , 1,1.76
 , ,16b, 1,1.45
 , ,32b, 1,2.92
 , ,64b, 1,2.92[/bash]

So no problems here, we can do 3 ALU logical operations.. no problem. Now NH SIMD LOGICAL operations:

[bash]LOGICAL,FP ,orps  , 1,0.98
       ,   ,orpd  , 1,0.98
       ,   ,andps , 1,0.98
       ,   ,andpd , 1,0.98
       ,   ,xorps , 1,0.98
       ,   ,xorpd , 1,0.98
       ,   ,andnps, 1,0.98
       ,   ,andnpd, 1,0.98
       ,INT,pand  , 1,1.22
       ,   ,pandn , 1,1.24
       ,   ,por   , 1,1.24
       ,   ,pxor  , 1,1.23
[/bash]

Now on an AMD Phenom I get for ALU latency/throughput:

[bash]add,r-r,8b , 1,2.90
   ,   ,16b, 1,2.91
   ,   ,32b, 1,2.89
   ,   ,64b, 1,2.91
and,r-r,8b, 1,2.91
   ,   ,16b, 1,2.91
   ,   ,32b, 1,2.90
   ,   ,64b, 1,2.89
or,r-r,8b, 1,2.85
   ,   ,16b, 1,2.91
   ,   ,32b, 1,2.91
   ,   ,64b, 1,2.87
xor,r-r,8b, 1,2.89
   ,   ,16b, 1,2.91
   ,   ,32b, 1,2.91
   ,   ,64b, 1,2.89
[/bash]

and for FP SIMD LOGICAL:

[bash]LOGICAL,FP ,orps  , 2,1.99
       ,   ,orpd  , 2,1.98
       ,   ,andps , 2,1.98
       ,   ,andpd , 2,1.96
       ,   ,xorps , 2,1.99
       ,   ,xorpd , 2,1.98
       ,   ,andnps, 2,1.98
       ,   ,andnpd, 2,1.99
[/bash]

This is crux of my question. These tests have been very useful to myself in the past. As stated previously there are other applications which measure latency/throughput, but my tests are more precise and I understand *precisely* what they measure instead of relying upon some 3rd party vendor tool to tell me and who knows what assumptions are made there. My tool makes no assumptions.. which is especially valuable for determining different INT/FP load latencies and FP<->GPR latencies.

perfwise