I recently noticed a large inconsistency in the data retrieved from
the perf_event hardware counters pertaining to the power consumption
This inconsistency was noticed upon a kernel upgrade from 4.4.0 to 4.14.20.
Processor specs: Intel(R) Xeon(R) CPU E5-2650 v3, 2.30 GHz (10 cores with 2 threads per core).
So, overall it's a 20 core, 40 thread machine.
TDP of the machine -- 210 Watts (105 Watts for each Xeon processor)
In my research lab, we use Performance Co-Pilot to measure power
consumption of the CPU and DRAM through the use of perfevent hardware
counters and RAPL.
Kernel Version: 4.4.0
idle power consumption of CPU -- 33.34 Watts
idle power consumption of DRAM -- 14.76 Watts
New results (after kernel upgrade)
Kernel Version: 4.14.20
Idle power consumption of CPU -- 33.23 Watts
Idle power consumption of DRAM -- 59.15 Watts
Can anyone give me some insight as to why this is the case?
The DRAM RAPL domain requires another energy unit as the package domain but only the one of the package domain is readable from some register (6.1E-5). As far as I know the energy unit for the DRAM domain is hard-coded in the kernel sources (something like energy-unit-of-package-domain/4 = 15.3E-6). Your numbers look like the DRAM domain uses the same energy unit as the PKG domain. This is just a guess because the scaling factor for your results between kernel 4.4 and 4.14 is almost exactly 4 (59.15/14.76 = 4.008).
The Xeon E5 v3 datasheet (document 330784) says that the DRAM energy unit on this product is a fixed 15.3 micro-Joules (1/65536 J), independent of the value used for the energy unit for the RAPL package domain.
This fixed DRAM energy unit appears to be carried forward on newer processors, but documentation is sparse (and sometimes incorrect).
It is pretty easy to do a "sanity check" on these numbers. The maximum power consumption of a DIMM will be in the 4-5 Watt range, so I typically check the power consumption while running STREAM and divide by the number of DIMMs. If the power per DIMM is in the 16 Watt range, you are using the wrong energy unit -- most likely the one from the RAPL_ENERGY_UNIT MSR.
Idle power is a lot harder to bound. Obviously it will be less than the maximum, but I have seen very large fluctuations in DRAM power on systems that were "idle". There are a number of power-saving mechanisms available in DDR4 DRAMs, but it is not easy to identify when these will be used and when they will not be used. There are performance counters in the memory controller of the uncore that can be used to monitor some of the power-saving modes (POWER_CHANNEL_DLLOFF, POWER_CHANNEL_PPD, POWER_CKE_CYCLES, POWER_SELF_REFRESH), but I have not tried to correlate these with energy consumption measurements.
Sorry for the late reply.
It turns out that the issue was due to incompatibility between Ubuntu xenial and linux kernel > 4.7.10.
I upgraded my distro to Ubuntu bionic (18.04) and kernel to 4.15.x and now RAPL readings for DRAM package are fine.
Thanks for valuable input.