Solved: Benchmark performance increases with deeper C-states (linux)

Beverly_K_ · ‎11-09-2014

I have been running the linpack and netperf benchmarks using Ubuntu 12.04. My machine has 2 physical (2 logical) core SandyBridge processors. I ran using different frequency and C-states configurations. I found that benchmark performance increased when dma_latency was not set to 0 (deeper C-states were allowed). How can this be?

Here are the details:

I have been able to switch between the intel_idle and acpi drivers by using the intel_idle.max_cstates=0 statement in GRUB at boot time. I checked that this is working correctly using 'cat /sys/devices/system/cpu/cpuidle/current_driver'

I can switch scaling governors using 'sh -c "echo userspace > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor'. I then used the userspace governor and used msr tools (wrmsr -pi 0x1a0 0x4000850089) for - 0-4 to disable turbo for all 4 cores. Then I could set the processor frequency using 'cpufreq-set -f freq -c i'. I checked that this was working using either i7z, powertop or turbostat.

Setting C-states was a little more tricky. I could disable the C-states using a small program that I copied from the powertop manual that keeps the dma_latency file open and sets the latency to 0. However, I have not yet been able to decipher the instructions in the Intel Manuals to set them using the msr tools (which would be the preferred method).

I ran the benchmarks as part of a bash script that also logged frequency and C-state using turbostat and i7z (at 1 second intervals). I first ran the benchmarks with the acpi_idle driver with dma_latency=0 and setting the the processor frequency to different values and got very nice and predictable results. I am monitoring the power from the laptop with a WattsUp meter and processor temperature using lm_sensors. All of these tools gave what was expected. The benchmark performance increased linearly with processor frequency and the power went up quadratically. I also got data using the different scaling governors (with and without turbo) for the various scaling governors in both acpi_idle and intel_idle. Processor temperature also went up with frequency.

The unexpected result is that when I tried to run the same benchmarks with dma_latency=0 vs allowing the C-states to wander where they will (I could use turbostat to watch this), the benchmark performance increased when C-states were allowed to go to deeper states (Such as C6). This was true for linpack and netperf and for all settings of governors and frequency. It is the opposite result of what I expected and from what I have read anywhere. Deeper C-States are supposed to give a performance hit. I have double-checked my results to be sure I am not making a mistake or just seeing some spurious result. Is there any explanation for this?

If you are wondering why I am doing this it is for a class project.

Any insights would be deeply appreciated. Also, I am wondering if there are any other benchmarks that I could run that would give a good indicator of the effect of different C-state occupancies.

Regards,

Beverly

McCalpinJohn · ‎11-10-2014

Aha! I remember something!

If you set "idle=poll" and you are using HyperThreading, then the kernel idle loop will be executing instructions and fighting for resources in the physical processor cores. If the LINPACK benchmark decides to use only two threads (since there are only two physical cores), then there will be two "idle" threads that spin in a tight polling loop waiting to be assigned a process to run. This would be fine if they were on their own cores, but with HyperThreading they will slow down the compute threads.

You should get an improvement in performance if you can force the code to use 4 logical processors instead of 2. The overall result will be slower because the code is typically blocked for one thread per L2 cache (so you will get lots of extra L2 misses), but at least they will be getting real work done instead of just fighting for issue slots.

Of course the right thing to do is never use HyperThreading and "idle=poll" at the same time!

Other notes:

If I understand correctly, it is not possible to "disable" the C1 state, but if you set "idle=poll" the kernel will never *use* the C1 state.
- The C1E state is another one that can't be disabled, but it won't effect performance on single-socket systems.
We also use the " intel_idle.max_cstates=0" boot option to disable the intel_idle driver and replace it with the acpi_idle driver. We disable the C3 and higher-numbered states by opening the "/dev/cpu_dma_latency" file and writing a value of 75. (This is done in a loadable kernel module that we don't ever unload, so the file is never closed.)

View solution in original post

Patrick_F_Intel1 · ‎11-09-2014

Hello Beverly,

I reproduced what you saw on my ivybridge laptop.

I got linpack from http://www.panticz.de/Linpack. I set my frequency to 1.5 GHz (nominal frequency is 2.5GHz). I don't know much about the dma_latency stuff so I opted for booting the linux kernel with and without 'idle=poll'. Booting with 'idle=poll' turns off cstates.

Taking the size=10000 linpack case, I found:

cstates Time GFlops

disabled 46.201 14.4340

enabled 30.629 21.7723

So the cstates disabled case took 46 seconds to finish and the cstates enabled case took 30 seconds.

Turbostat showed for both cases that the frequency was 1.5Ghz. Also turbostat showed that the 'idle=poll' case had zero time in c1, c3, c6 and c7. The no 'idle=poll' case had no time in c3, c6 or c7 but it did have time in c1, sometimes one of the HT threads of each core was 100% in c1.

I don't know much about linpack internals. It seems like (from the turbostat data) linpack varies between using both HT logical cpus on a core and using only 1 HT cpu.

I'm not sure why enabling the cstates improves performance. I was going to guess that the extra power used by the 'should be idle' HT thread is impacting the amount of power that the active HT thread has available. But this doesn't make a lot of sense since the power used at 1.5 Ghz is about half the peak power when the frequency is unrestricted.

Maybe someone else (Tim Prince?) has better insight.

Pat

McCalpinJohn · ‎11-10-2014

The fast result of 21.7723 GFLOPS corresponds to 90.7% of peak for two cores at 1.5 GHz, so this is pretty clearly the correct performance level.

A couple of ideas that might help clarify what is happening:

Try the same experiment with HyperThreading ("logical processors") disabled.
Try the same experiment with KMP_AFFINITY=verbose,compact,granularity=fine.
- If the executable you are using does not use Intel's OpenMP implementation, then you might want to try the Intel MKL DGEMM benchmark instead. There is a download link attached to the article at:
- https://software.intel.com/en-us/articles/a-simple-example-to-measure-the-performance-of-an-intel-mkl-function

Beverly_K_ · ‎11-10-2014

Hi,

Thanks for the responses. I took some more data to investigate what is happening and found out that the performance increase (which is real) only happens between latency=0 and latency=3 us. This corresponds to a transition between no C-states at all and C1. I read up on C1 and it is not really a "slow" wake-up state. After the C1 states (with higher latencies) there is a slight performance hit,but it is small.

So it's something in particular about the C1 state. I am getting the impression that setting the latency to 0 is not a very efficient mode to operate in. I wonder if the difference between no C-states and C1 is how the threading operates.

I also recall reading somewhere that with the intel_idle driver C1 is set automatically (even with latency=0) and this is starting to make a lot more sense. I am currently set to acpi_idle.

Beverly

McCalpinJohn · ‎11-10-2014

Aha! I remember something!

If you set "idle=poll" and you are using HyperThreading, then the kernel idle loop will be executing instructions and fighting for resources in the physical processor cores. If the LINPACK benchmark decides to use only two threads (since there are only two physical cores), then there will be two "idle" threads that spin in a tight polling loop waiting to be assigned a process to run. This would be fine if they were on their own cores, but with HyperThreading they will slow down the compute threads.

You should get an improvement in performance if you can force the code to use 4 logical processors instead of 2. The overall result will be slower because the code is typically blocked for one thread per L2 cache (so you will get lots of extra L2 misses), but at least they will be getting real work done instead of just fighting for issue slots.

Of course the right thing to do is never use HyperThreading and "idle=poll" at the same time!

Other notes:

If I understand correctly, it is not possible to "disable" the C1 state, but if you set "idle=poll" the kernel will never *use* the C1 state.
- The C1E state is another one that can't be disabled, but it won't effect performance on single-socket systems.
We also use the " intel_idle.max_cstates=0" boot option to disable the intel_idle driver and replace it with the acpi_idle driver. We disable the C3 and higher-numbered states by opening the "/dev/cpu_dma_latency" file and writing a value of 75. (This is done in a loadable kernel module that we don't ever unload, so the file is never closed.)

Patrick_F_Intel1 · ‎11-11-2014

Hello Dr. McCalpin,

I ran linpack with HT disabled. In this case the performance was the same regardless of using 'idle=poll' or not. And the performance as slightly faster than the "HT-enabled, cstates enabled" case.

So I'm going to vote for Dr. McCalpin that (in this case) HT slows down the other HW thread on the same core when the cstates are disabled.

Pat

McCalpinJohn · ‎11-11-2014

I don't know how the Intel HPL implementation handles HyperThreading, so a couple of different things might be happening with HT enabled and C-states enabled.

The fastest DGEMM implementations typically use block sizes of about 100, so that three blocks will fit in each core's L2 cache. Of course if there are two threads running on a core you need to either adjust the block size downward (which hurts performance slightly) or accept a significant increase in L2 miss rate (which also hurts performance slightly). Of course the implementation might decide not to use all the logical processors -- if the implementation uses one thread per physical core the performance should be close to the performance with HT disabled.

venkatesh_v_ · ‎08-23-2016

Hi All,

Is there a way to validate the C state latency provided in the intel-idle.c for say cherrytrail?

Any examples ?

The goal is to validate these latencies and optimize the platform.

thanks

Venky