Software Tuning, Performance Optimization & Platform Monitoring
Discussion regarding monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform updating.

How to debug ProcHot?

Adrian_C_
Beginner
6,200 Views

Not sure if this is the right forum.  We have a Xeon server platform that is encountering a ProcHot after running IO meter tests.  We can configure our BMC sensors to collect the temp. readings.  What other tools are available for debug?

0 Kudos
1 Solution
McCalpinJohn
Honored Contributor III
6,200 Views

Not sure what you mean by "debug" in this case.... 

The first thing to do is determine whether this is an "internal" or "external" PROCHOT.  

  • Section 14.8 of Volume 3 of the Intel Architectures SW Developer's Manual discusses the IA32_PACKAGE_THERM_STATUS register (MSR 0x1b1).  Bits 0-1 tell you about the internal temperature sensors, and bits 2-3 tell you about the external PROCHOT signal.
  • Due to the power-throttling in recent processors, I typically monitor IA32_PACKAGE_THERM_STATUS (MSR 0x1b1) and sometimes monitor the core-specific values in each core's IA32_THERM_STATUS register (MSR 0x19C).
    • Starting with the Haswell processors, additional information is available in the MSR_CORE_PERF_LIMIT_REASONS register (MSR 0x690).
    • While many of the bits in these MSRs clearly work as described, I am not able to test all of them, and some of the bits give suspicious results.

If the digital temperature sensors in the IA32_PACKAGE_THERM_STATUS register show that you are actually reaching the PROCHOT temperature, then the obvious checks are:

  • Check the airflow path for obstructions.
  • Monitor the fans to make sure they are spinning up properly.
  • Check the heat-sink installation to make sure they were installed properly & with proper thermal paste.

I had one system with a bad thermal solution that hit PROCHOT under fairly light loads.  I verified that the airflow path was OK, but did not get around to checking the fans or heat-sink installation -- I was having too much fun studying the behavior of the processor while it was throttled!  (This was one of several pre-production loaner systems from a vendor, so it was not critical to have it fully operational.)

We have also seen cases where the external PROCHOT signal was applied even when the processor reported temperatures well below PROCHOT.  These were tracked down to bugs in the software on the system management controllers and were eventually fixed.  The good news is that these were temporary glitches, and when the external PROCHOT signal was dropped, the processors returned to normal behavior.

In the Xeon E5 processors, it is possible that additional information can be obtained from the performance counters in the Power Control Unit of the Uncore.  These are described in the Uncore Performance Monitoring manuals for various processor generations (v1, v2, v3, v4).  (I have not tried this myself -- most of our systems are kept very cool, so we see power-throttling frequently, but not thermal throttling.)

View solution in original post

0 Kudos
6 Replies
McCalpinJohn
Honored Contributor III
6,201 Views

Not sure what you mean by "debug" in this case.... 

The first thing to do is determine whether this is an "internal" or "external" PROCHOT.  

  • Section 14.8 of Volume 3 of the Intel Architectures SW Developer's Manual discusses the IA32_PACKAGE_THERM_STATUS register (MSR 0x1b1).  Bits 0-1 tell you about the internal temperature sensors, and bits 2-3 tell you about the external PROCHOT signal.
  • Due to the power-throttling in recent processors, I typically monitor IA32_PACKAGE_THERM_STATUS (MSR 0x1b1) and sometimes monitor the core-specific values in each core's IA32_THERM_STATUS register (MSR 0x19C).
    • Starting with the Haswell processors, additional information is available in the MSR_CORE_PERF_LIMIT_REASONS register (MSR 0x690).
    • While many of the bits in these MSRs clearly work as described, I am not able to test all of them, and some of the bits give suspicious results.

If the digital temperature sensors in the IA32_PACKAGE_THERM_STATUS register show that you are actually reaching the PROCHOT temperature, then the obvious checks are:

  • Check the airflow path for obstructions.
  • Monitor the fans to make sure they are spinning up properly.
  • Check the heat-sink installation to make sure they were installed properly & with proper thermal paste.

I had one system with a bad thermal solution that hit PROCHOT under fairly light loads.  I verified that the airflow path was OK, but did not get around to checking the fans or heat-sink installation -- I was having too much fun studying the behavior of the processor while it was throttled!  (This was one of several pre-production loaner systems from a vendor, so it was not critical to have it fully operational.)

We have also seen cases where the external PROCHOT signal was applied even when the processor reported temperatures well below PROCHOT.  These were tracked down to bugs in the software on the system management controllers and were eventually fixed.  The good news is that these were temporary glitches, and when the external PROCHOT signal was dropped, the processors returned to normal behavior.

In the Xeon E5 processors, it is possible that additional information can be obtained from the performance counters in the Power Control Unit of the Uncore.  These are described in the Uncore Performance Monitoring manuals for various processor generations (v1, v2, v3, v4).  (I have not tried this myself -- most of our systems are kept very cool, so we see power-throttling frequently, but not thermal throttling.)

0 Kudos
CyrIng
Novice
5,188 Views

Hi John,

Sorry to wake-up this thread up but recently I have cleared those IA32_THERM_STATUS | IA32_PACKAGE_THERM_STATUS log bits on Coffee Lake processors which froze immediately.

For exemple: " PROTCHOT # or FORCEPR# Log (R/WC0) "

As mentioned by the SDM: " Software may clear this bit by writing a zero. "

 

The SDM specifies those bits as "R/WC0", but I don't find the meaning of this abbreviation and I believe it might make the difference when writing zero in those bits from my privileged kernel driver.

Do you think I miss to check some CPUID or other MSR registers prior altering these log bits ?

Or, am I facing undocumented exceptions to these IA registers ?

 

Thanks for any help,

Regards,

Cyril

0 Kudos
CyrIng
Novice
5,094 Views

It may sound obvious but don't clear IA32_THERM_STATUS Log bits if IA32_THERM_INTERRUPT High or Low Temperature Interrupt bit is enabled.

An thermal interrupt handler might have been installed by the Kernel or SMI and thus claim control over those registers.

Issue fixed in this thread.

0 Kudos
Adrian_C_
Beginner
6,200 Views

Thank you.  How would you check the MSRs on a Windows system?  

I saw this link.  http://faydoc.tripod.com/cpu/rdmsr.htm

"This instruction must be executed at privilege level 0 or in real-address mode"

How do we get into either of those modes?

0 Kudos
McCalpinJohn
Honored Contributor III
6,200 Views

I don't know what the magic infrastructure looks like on Windows.  You have to figure out how to run in the kernel -- the "real-address mode" comment is ancient history.

Windows has some debug facilities to access MSRs -- e.g., https://msdn.microsoft.com/en-us/library/windows/hardware/ff553516(v=vs.85).aspx

Intel has some code that allows their Performance Counter Monitor to access the MSRs on Windows systems -- have a look at https://software.intel.com/en-us/articles/intel-performance-counter-monitor

It looks like Intel has discontinued PCM in favor of an open source project at https://github.com/opcm/pcm. ; This has some notes on Windows.  One of the directories in the githib site is https://github.com/opcm/pcm/tree/master/PCM-Power_Win, which might have what you need already....

0 Kudos
Roman_D_Intel
Employee
6,200 Views

Hi,

PCM-power utility can show prochot metrics out-of-the-box. You can download it here: https://github.com/opcm/pcm (and here are the binaries: https://ci.appveyor.com/project/opcm/pcm/build/artifacts )

Thanks,

Roman

0 Kudos
Reply