Similar questions arise with instructions that are retried until their operands are available. Although the Performance Optimization guide says that on Sandy Bridge: "The Scheduler component queues micro-ops until all source operands are ready. Schedules and dispatches ready micro-ops to the available execution units (...)". But we know that some counters (e.g, events 10h and 11h) that are labelled "uops executed" are actually incremented multiple times when the corresponding instructions are waiting on cache misses. So clearly the scheduler is *not* queuing the micro-ops until all the source operands are ready. Instead it appears to be issuing the instruction to the execution unit, letting the execution unit reject the instruction if the arguments are not ready, then repeating the process until eventually the required source operands are actually ready and the instruction can actually execute. For the STREAM benchmark I see the Event 10h and 11h floating-point counters over-counting by factors of 4 to 6 depending on the system load, while workloads with few cache misses see over-counting of as low as a few percent.
I have not checked to see if these redundant floating-point instructions are being counted by Event A1 as cycles in which uops are dispatched to various ports, or if cycles in which all uops issued end up being rejected are counted in the CYCLES_NO_DISPATCH event. (I would guess that they are counted as dispatched uops and therefore not counted as NO_DISPATCH cycles, but it is hard to tell without careful testing.)
It would be a lot easier if the documentation were more precise in its use of the words "issue", "dispatch", and "execute". I noticed that Intel does not use the word "reject" in reference to the core operation in either the Performance Optimization Guide or Volume 3 of the SDM, so I am guessing that they are not interested in discussing the microarchitecture at that level of detail. Unfortunately that level of detail has a strong impact on values that show up in the performance counters, leading to a fair amount of confusion....
I'm looking at the event definitions in the SDM vol 3 (Feb 2014), table 19-7, (including link here for others... you probably have this already).
I know the event descriptions can be skimpy. Believe it or not, a lot of time and discussion goes into these sometimes not-very-descriptive descriptions. But muddling onwards...
The 'UOPS_DISPATCHED_PORT.*' events 0xa1 talk about UOPs getting dispatched to ports. So apparently 'dispatching' refers to UOPs getting sent to ports. The first two 'CYCLE_ACTIVITY' events talk about counting cycles when cache misses are pending. I'm guessing that 'CYCLE_ACTIVITY.CYCLES_NO_DISPATCH' refers to cycles during which no UOPs are sent to the ports due to cache misses (and maybe other reasons). This probably is a measure of starvation of the front-end due to the backend (caches) not delivering UOPs or required operands.
I will check with the powers-that-be if this is accurate.