Hi

perfwise · ‎10-10-2012

I was trying to measure the latency/throughput of vpxor and found that if I repeatedly execute:

vpxor %xmm0,%xmm4,%xmm4
vpxor %xmm2,%xmm4,%xmm4
vpxor %xmm0,%xmm4,%xmm4
vpxor %xmm2,%xmm4,%xmm4

I get a throughput of 0.5 which implies a latency of 2. However if I run the following:

vpxor %xmm0,%xmm4,%xmm4
vpxor %xmm4,%xmm0,%xmm0
vpxor %xmm0,%xmm4,%xmm4
vpxor %xmm4,%xmm0,%xmm0

I measure a throughput of 1 and a latency of 1.

Why am I observing this difference? SC scheduling of the operations? I observe also that the operations are flowing evenly down all 3 pipes which can execute vpxor (pipes 0, 1 and 5).

Lastly, there's no SC stall measuring the RESOURCE STALLS in PMC 0xA2. Your PMC stats for 0xA2 imply that you're stalled by your ROB. If you're scheduler is 54 entries as documented wouldn't you stall on SC before you stall in the ROB as denoted in your optimization manual?

Thanks for any valuable pointers to the insighful questions I raise above.

Perfwise

SHIH_K_Intel · ‎11-12-2012

Hi This is just my personal observation for your consideration... In general, when it comes to a methodology for writing directed test code to measure latency or throughput, I believe in the following: 1. To measure latency, you must introduce dependency chain and preferrably long enough to overwhelm the in-flight window. you can do this with intra-loop + inter-loop dependency. 2. To measure throughput, you want to remove all possible form of dependency... So, each given piece of code may give you different mileage in terms of what it represents vs. what you intended to produce. The first snippet looks like an attempt to measure throughput of vpxor, not a latency test of vpxor. The 2nd snippet looks like reasonable latency test of vpxor. you can change the context of snippet of measure the throughput of a "4-instruction sequence", but that is not the latency of vpxor. As to the reason of why your observation produces 0.5 cycle. Here's an explanation based on plausibility... I think the RAT wil do the renaming, so in the dispatching and execution inside the OOO, port 0, 1, 5 can handle these uops. but retiremen must follow "program" order. So, even the results of the 3rd instruction is ready to commit architecturally in the same cycle, the "program" order probably required it to wait out the conflict in the the architectural register. In other words, the first snippe have removeed enough dedpendcy to produce result that is less than one per cycle, but not enough effort on removing all dependencies.

SB/IV throughput difference of 2x for same test of vpxor