Solved: Many Thanks! I will go

LY · ‎10-26-2015

Dear all,

These days, I am trying to use perf to evaluate my codes. I found there are two hardware events which are reference cpu cycles and cpu cycles. And I know ref-cpu-cycles is working at the fixed cpu frequency, and cpu-cycles could be changed along with frequency scaling. Now what I am confused about is whether I could use this ref-cpu-cycles to get a piece of code's execution time? How to get the fixed cpu frequency? From /proc/cpuinfo? However, according to my knowledge, there is no real time concept in cpu except the real-time system. So, how this ref-cpu-cycles works? How to get the fixed cpu frequency? Does it have some connections with the external real time clock (RTC)?

In addition, modern cpu could process much more than necessary instructions. The hw event - instructions - in perf is to count the retired instructions which only includes the necessary instructions. If it is like this, however it seems that the ref-cpu-cycles/cpu-cycles also counts cycles when unnecessary instructions are being executed.

I don't know whether I clarified my question. Or maybe I am totally wrong. Could some help me?

Thanks.

McCalpinJohn · ‎10-28-2015

The term "reference cycles" is a bit confusing because it is used in two different ways. As I described above, the programmable performance counter event CPU_CLK_THREAD_UNHALTED.REF_XCLK (Event 0x3c, Umask 0x01) counts cycles of the 100 MHz "reference clock" that the hardware provides to the chip. The fixed-function counter CPU_CLK_UNHALTED.REF counts at the rate of the TSC, which is 100 MHz times the "maximum non-Turbo" ratio multiplier. The ratio of these two events should always be the same fixed value, since both count at the same rate when the core is not halted (no matter what the core frequency) and both stop counting when the core is in a halt state.

A processor core is only halted when the OS has no tasks to schedule on that core, so this has nothing to do with how the chip implements instruction-level parallelism.

The implementation of instruction-level parallelism does have an influence on counting processor stalls, which is a very different topic. There are a number of discussions of the complexities of understanding stalls, for example at:

View solution in original post

McCalpinJohn · ‎10-27-2015

The hardware event for "reference cycles not halted" counts at the same rate as the TSC, but only counts while the processor core is not halted (while the TSC always counts). This makes it very convenient to compute the number of cycles (or the fraction of the cycles) that the processor *is* halted.

There are two different ways of counting "Reference Cycles Not Halted" with the core hardware performance monitoring unit, but they are implemented slightly differently on some systems. One event is available using the "fixed function performance counter 2", which can be read either via MSR 0x30B or using the RDPMC instruction with counter number ( (1<<30) + 2). The other event is available using any of the programmable performance counters, using EventSelect 0x3C and UnitMask 0x01.

These two events count the same thing, but they are scaled differently on some/many/most (???) systems. For example, on my Xeon E5-2680 (Sandy Bridge EP) and my Xeon E5-2660 v3 (Haswell EP) systems, the programmable performance counter increments once for every cycle of the 100 MHz reference clock. In contrast, the fixed-function counter 2 increments by 27 (the nominal CPU and TSC multiplier ratio) for every cycle of the 100 MHz reference clock on the Xeon E5-2680 (2.7 GHz Sandy Bridge EP), and by 26 (the nominal CPU and TSC multiplier ratio) for every cycle of the 100 MHz reference clock on the Xeon E5-2660 v3 (2.6 GHz Haswell EP).

I seem to recall that the two versions of this event are scaled the same way on some older systems, but it might have been the documentation that was confusing.

You can obtain the value of the nominal CPU (and TSC) multiplier ratio in several ways. Many people just parse the output of /proc/cpuinfo, and assume that all processors starting with Sandy Bridge use a 100 MHz reference clock. You can also get the value from bits 15:8 of MSR_PLATFORM_INFO (MSR 0xCE). A rather bizarre, but portable, method is to parse the "Brand ID" string from the CPUID command to get the decimal frequency. An example is provided by the "get_frequency_from_cpuid" function in Intel PCM. You have to know the value of the reference clock (100 MHz from Sandy Bridge to present, 133 MHz for Westmere and prior processors) to compute the ratio. BUT, this difference in the scaling of the two "reference cycles not halted" suggests another approach -- just read the counters, busy-spin for a while, read the counters again and divide the differences to get the multiplier ratio. If the measurement interval is at least a few thousand cycles, the ratio calculation should be close enough to an integer for rounding to give you the right answer. (You will want to pin the thread to a single core while doing this, but the result will be applicable to all cores in the system, and since it is the base ratio that is being computed, you don't need to worry about it changing over time).

LY · ‎10-28-2015

Thanks for answering my questions! I probably understand what you mean. However, "Reference cpu cycles not halted" just represents the cycles when cpu is not halted. As we know, according to ILP, modern processor processes much more instructions than necessary which means when it processes instructions and it is not halted, then the cycles when it processes unnecessary instructions will be counted into "Reference cpu cycles not halted". If I needs to know a specific piece of code's reference cpu cycles, what do I need to do? By using cpuid?

In addition, could I think the word "reference" means something references another thing? So, all the reference cpu cycles (not halted or both) refers to TSC?

Thanks.

McCalpinJohn · ‎10-28-2015

The term "reference cycles" is a bit confusing because it is used in two different ways. As I described above, the programmable performance counter event CPU_CLK_THREAD_UNHALTED.REF_XCLK (Event 0x3c, Umask 0x01) counts cycles of the 100 MHz "reference clock" that the hardware provides to the chip. The fixed-function counter CPU_CLK_UNHALTED.REF counts at the rate of the TSC, which is 100 MHz times the "maximum non-Turbo" ratio multiplier. The ratio of these two events should always be the same fixed value, since both count at the same rate when the core is not halted (no matter what the core frequency) and both stop counting when the core is in a halt state.

A processor core is only halted when the OS has no tasks to schedule on that core, so this has nothing to do with how the chip implements instruction-level parallelism.

The implementation of instruction-level parallelism does have an influence on counting processor stalls, which is a very different topic. There are a number of discussions of the complexities of understanding stalls, for example at:

LY · ‎10-28-2015

Many Thanks! I will go through the materials you listed.

How to define Reference CPU cycles in PMU