Software Tuning, Performance Optimization & Platform Monitoring
Discussion regarding monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform updating.

What does CYCLES_NO_DISPATCH do on Sandy Bridge?

Karl_W_
Beginner
551 Views
Hi, I was looking through the "Performance Tuning Techniques For Intel® Microarchitecture Code Name Sandy Bridge" section from the "Optimization Reference Manual" (July 2013) when I got a bit puzzled by the CYCLE_ACTIVITY.CYCLES_NO_EXECUTE monitoring event. I could not find this event for Sandy Bridge (my platform is Xeon E5-1650 (06_2DH)) in the SDM, however CYCLE_ACTIVITY.CYCLES_NO_DISPATCH seems to be the same as CYCLE_ACTIVITY.CYCLES_NO_EXECUTE on Ivy Bridge, which has the same event num. (0xA3) and umask(0x04). Is this Correct? The next thing I was wondering about in respect to the above event is: what is it actually counting? I would assume, that all cycles in which no execution port is busy are being counted. However, I have some measurements that resulted in CYCLES_NO_DISPATCH > CPU_CLK_UNHALTED.CORE. This would suggest, that either my assumption is wrong or the core is actually doing less than no work. As an alternative I tried to use UOPS_DISPATCHED w/ cmask 0x01 and the INV bit set. This gives a more realistic count for the number in which no uop was dispatched. Is there a way to use this instead of CYCLES_NO_EXECUTE? Note in particular the distinction I am making between dispatching and being busy (i.e. would a never ending stream of DIVs be counted on every tick or just when the uop is dispatched (so roughly 1/10 of all clock cycles)). Thanks Karl
0 Kudos
4 Replies
McCalpinJohn
Honored Contributor III
551 Views

Similar questions arise with instructions that are retried until their operands are available.  Although the Performance Optimization guide says that on Sandy Bridge: "The Scheduler component queues micro-ops until all source operands are ready.  Schedules and dispatches ready micro-ops to the available execution units (...)".   But we know that some counters (e.g, events 10h and 11h) that are labelled "uops executed" are actually incremented multiple times when the corresponding instructions are waiting on cache misses.  So clearly the scheduler is *not* queuing the micro-ops until all the source operands are ready.  Instead it appears to be issuing the instruction to the execution unit,  letting the execution unit reject the instruction if the arguments are not ready, then repeating the process until eventually the required source operands are actually ready and the instruction can actually execute.   For the STREAM benchmark I see the Event 10h and 11h floating-point counters over-counting by factors of 4 to 6 depending on the system load, while workloads with few cache misses see over-counting of as low as a few percent. 

I have not checked to see if these redundant floating-point instructions are being counted by Event A1 as cycles in which uops are dispatched to various ports, or if cycles in which all uops issued end up being rejected are counted in the CYCLES_NO_DISPATCH event.  (I would guess that they are counted as dispatched uops and therefore not counted as NO_DISPATCH cycles, but it is hard to tell without careful testing.)

It would be a lot easier if the documentation were more precise in its use of the words "issue", "dispatch", and "execute".   I noticed that Intel does not use the word "reject" in reference to the core operation in either the Performance Optimization Guide or Volume 3 of the SDM, so I am guessing that they are not interested in discussing the microarchitecture at that level of detail.  Unfortunately that level of detail has a strong impact on values that show up in the performance counters, leading to a fair amount of confusion....

0 Kudos
Karl_W_
Beginner
551 Views
Thanks for the answer, even though it seems to be raising even more questions. Due to your hint on overcounting the UPOS_EXECUTED/DISPATCHED I set up a small test, that is scalar and uses divisions (and thus reducing cache misses to nearly zero). That way the correlation between UOPS_EXECUTED and UOPS_EXECUTED_PORT_X seem very consistent, however I still cannot determine what CYCLES_NO_DISPATCH could possibly be counting. To be honest, I am not even sure what it is supposed to be counting. I would be very grateful if you could give me a hint on that. Thanks Karl
0 Kudos
Patrick_F_Intel1
Employee
551 Views

Hello Karl,

I'm looking at the event definitions in the SDM vol 3 (Feb 2014), table 19-7, (including link here for others... you probably have this already).

http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-system-programming-manual-325384.pdf

I know the event descriptions can be skimpy. Believe it or not, a lot of time and discussion goes into these sometimes not-very-descriptive descriptions. But muddling onwards...

The 'UOPS_DISPATCHED_PORT.*' events 0xa1 talk about UOPs getting dispatched to ports. So apparently 'dispatching' refers to UOPs getting sent to ports. The first two  'CYCLE_ACTIVITY' events talk about counting cycles when cache misses are pending. I'm guessing that 'CYCLE_ACTIVITY.CYCLES_NO_DISPATCH' refers to cycles during which no UOPs are sent to the ports due to cache misses (and maybe other reasons). This probably is a measure of starvation of the front-end due to the backend (caches) not delivering UOPs or required operands.

I will check with the powers-that-be if this is accurate.

Pat

0 Kudos
Karl_W_
Beginner
551 Views
Hello Pat, thanks for trying to dissect the actual meaning of CYCLE_ACTIVITY.CYCLES_NO_{DISPATCH,EXECUTE}. It sounds quite logical and it matches what I was thinking. What puzzled me was, that the CYCLE_ACTIVITY.CYCLES_NO_DISPATCH count is actually larger than CPU_CLK_UNHALTED_CORE, which means it cannot simply be counting cycles. Perhaps it is really counting UOPs that were not dispatched due to cache misses. In the following some measurements for a simple C = A / B example, which should keep the cache misses low, due to the long execution of the divides: CPU_CLK_UNHALTED_CORE 2.20646e+09 UOPS_DISPATCHED_THREAD 5.5317e+08 UOPS_DISPATCHED_THREAD:C1 3.48104e+08 UOPS_DISPATCHED_THREAD:C1 inverse 1.86662e+09 CYCLE_ACTIVITY.CYCLES_NO_DISPATCH 7.43333e+09 Karl
0 Kudos
Reply