I'm new at reading performance counters so I may be completely off-base here. I think I am doing things correctly so please let me know if I'm wrong.
I am using wrmsr and rdmsr on Linux to read MSRs and performance counters for Westmere X5650 CPUs. Some of our processors are getting throttled for unknown reasons and the only things I can determine are that MSR IA32_THERM_STATUS bit 2 shows that "PROCHOT# or FORCEPR# is being asserted by another agent on the platform".
I have looked at several values so far and this is what I have found: event num, mask - detected or not 0x67, 0x1 - NOTHING (UNC_DRAM_THERMAL_THROTTLED)
0x80, 0x1 - NOTHING (UNC_THERMAL_THROTTLING_TEMP.CORE_0) core was
0x81, 0x1 - DETECTS (UNC_THERMAL_THROTTLED_TEMP.CORE_0) core was
throttled because it was above temp
0x82, 0x1 - DETECTS (UNC_PROCHOT_ASSERTION) entire proc above threshold
0x83, 0x1 - DETECTS (UNC_THERMAL_THROTTLING_PROCHOT.CORE_0)
Is it possible for UNC_THERMAL_THROTTLING_TEMP.CORE_0 to not increment at all but UNC_THERMAL_THROTTLED_TEMP.CORE_0 does?
According to SDM 3B, Appendix A.3, Table A-5: UNC_THERMAL_THROTTLING_TEMP.CORE_0: Cycles that the PCU records that core is above the thermal throttling threshold temperature.
UNC_THERMAL_THROTTLED_TEMP.CORE_0: Cycles that the PCU records that core is in the power throttled state due to cores temperature being above the thermal throttling threshold.
Based on my limited understanding it appears that if the "core is in the power throttled state due to the core's temperature being above the thermal throttling threshold" then the counter that shows that the "core is above the thermal throttling threshold temperature" should increment the same amount or more. It does not increment, however.
IA32_THERM_STATUS for each core only shows bit 2 and 3 activated (external PROCHOT# or FORCEPR#) and not 0 or 1 (thermal throttling due to core temperature).