I try to measure how many cycles a function takes approximatively.
I want to try it with `rdtsc` and performance counter with `rdpmc` and after having configured CPU_CLK_UNHALTED.THREAD_P.
If I understand well :
* rdtsc use a constant frequency
* CPU_CLK_UNHALTED.THREAD_P use the core(thread?) frequency
I tried to measure how many cycles the function `sleep(1)` takes.
Because my CPU is a Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz, I expect arround 3600000000 cycles.
It is approximatively what I get with rdtsc but the result of CPU_CLK_UNHALTED.THREAD_P is smaller than rdtsc.
I think it is because the core clock is stopped during the sleep function because when i run another process which only do an infinite loop(while(1);) on the same thread, then rdtsc and CPU_CLK_UNHALTED.THREAD_P give rougthly the same result.
The question is : it is possible to prevent a core to stop its frequency and run it at a constant frequency?
I tried :
- in /etc/default/grub : GRUB_CMDLINE_LINUX="processor.max_cstate=1 intel_idle.max_cstate=0"
- echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo
- cpupower frequency-set -g performance
But it doesn't work well.
There are two questions here -- one related to preventing processor halting and one relating to preventing frequency shifting.
(1) The best way to stop a core from going into a "Halt" state is to run something on it continuously. The "sleep" command sets up an OS timer, then does a "yield()" call to return the logical processor to the OS. If the OS has no processes to run on that logical processor, it will use the monitor/mwait mechanism to put the logical processor into the C1 state (which is a "halted" state). In this case, the CPU_CLK_UNHALTED.THREAD_P counter will measure how long it takes to set up the timer, yield the logical processor, and restart the user process, and get to the next performance counter read. This is likely to be a few 10's of microseconds out of the 1 second interval.
It should only take a few tries to calibrate an active spin loop to run for very close to 1 second. Dependent integer "add" instructions should run at one per cycle, so a loop that accumulates the sum of the integers from 1 to 3,600,000,000 should take close to one second if the logical processor is actually running at 3.6 GHz.
It is difficult to completely prevent cores from going into the Halt state without using explicit spin loops, but you are getting close. In addition to disabling the Intel C-state driver, you also need to disable the C1 state (which is a "halted" state) that the OS uses when it does not have a process to run on a core. I think that the extra kernel option that you need is "idle=poll", which will cause the OS to put logical processors into a software spin loop when there is no process ready to run. This is generally considered a bad idea, since it will waste a lot of power, but there may be some cases where the faster "wake-up" time for an idle logical processor using this mode is a benefit.
(2) Preventing frequency shifting depends on the version of Linux. For recent versions of Linux (3.x and newer), the frequencies are controlled by an Intel-provided p-state controller that is built into the kernel (i.e., not a visible loadable kernel module). The user interface is the "cpupower" command. The command I use to fix the frequency is of the form:
cpupower frequency-set --min 3600M --max 3600M
For convenience, I often set the frequency to match the nominal frequency so that core cycles match reference cycles and TSC cycles.
Note that if a core is not being used (as in your "sleep()") example, this will not prevent it from going into a "halt" state. In the "halt" state, the CPU_CLK_UNHALTED.THREAD_P counter will not increment. But the full story is more complex. Testing around the edges of the "halt" state suggests that the requested "minimum" and "maximum" requested core frequencies are ignored by the hardware when a processor is "halted". When a logical processor comes out of the "halt" state it looks like is starts at the minimum supported frequency (typically 1.2 GHz), and is then ramped up to a frequency in the requested range. This only effects very short-duration measurements (e.g., millisecond scale) immediately after coming out of the "halt" state -- once processes are "ramped up", the hardware obeys the requested minimum and maximum frequencies (unless there is another applicable limiter, such as power or temperature, that forces the hardware to choose a lower frequency).