The following experiment is for my graduate research. I wanted experimental data to show how CPU frequency affects performance in terms of instruction-retirement rate and power consumption. In addition to hardware performance counters as accessible from perf, I am using the RAPL features of newer intel architectures to measure processor power.
The Experimental Setup
I am running on an x86 processor from Intel, i7-3610QM. For the purposes of these experiments, I am using a custom-compiled Linux kernel configured to be non-SMP and to disable cgroup-based scheduling.
During boot, the kernel is passed this additional command-line argument: “idle=poll”. The reason for having a polling idle is because I am interested in measuring dynamic power at different CPU frquencies, and I need to know that the processor is continuously running in a P-state and not entering any sleep states.
Once the kernel is booted, I disable limits imposed on the execution of real-time tasks by writing “-1” to “/proc/sys/kernel/sched_rt_runtime_us”.
I have written a kernel module that, in addition to other things, periodically samples the hardware performance counters measuring retired-instruction count using perf-related API. Also MSRs related to the RAPL energy-status registers are read using inline assembly code. For the purposes of these experiments, the the kernel module makes measurements at periods of 10ms.
A user-level program, running at a real-time priority and FIFO scheduling policy, periodically wakes up and acquires these measurements from the kernel module. The measurements are logged in the memory of this user-level program and written to a file at the end of the experiment.
While all this is happening, a background task is run decoding a 480p mpeg4 video file. This background application is based on the ffmpeg library. The entire video file that is to be decoded is read and demuxed into memory during initialization of the application. The video frames are kept and audio frames discarded. Once all memory that can be allocated is allocated, all allocated memory is locked to prevent any paging during execution (using the "mlock" function). Also, this application is run at real-time priority and FIFO scheduling policy, but at a lower priority than the measurement application mentioned above. These steps are taken to ensure that the CPU is busy most of the time doing video decoding and not waiting on disk or other I/O operations.
A script is run to do the following steps.
a) Get the set of frequencies available on the platform using the file “/sys/devices/system/cpu/cpu0/cpufreq/scaling_available_frequencies”
b) Set the frequency governor to “userspace”
c) For each CPU frequency supported by the platform, loop
Do the following 10 times
- Start the measurement application in the background for one minutes
- While the measurement application is running
- run the ffmpeg application
The following two scatter plots show CPU frequency on the x-axis and the average instructions-retired per nanosecond and average joules per nanosecond on the y-axes, respectively. The averages are takes over the duration of the experiments described above.
(Sorry I'm new to this forum. Not sure how to insert the images here, but I have attached them below.)
I believe the polling idle in x86 linux is implemented with a piece of assembly code that loops on a bunch of noops. Look at the following links:
1) The actual polling code (lines 654-662)
2) The code that calls the poling code (lines 42-52):
Are noops excluded from retired-instruction counts as measured by the hardware PMCs?
As someone on Quora pointed out, the reason for the described behavior is likely Intel Turbo Boost: