I have two S2600CO4 boards and neither of them reports processor thermal margin sensors.P1 Therm Margin | na | degrees C | na| na | na | na | na | na | na P2 Therm Margin | na | degrees C | na| na | na | na | na | na | na P1 Therm Ctrl % | na | percent| na| na | na | na | 30.000| 50.000| na P2 Therm Ctrl % | na | percent| na| na | na | na | 30.000| 50.000| na
Nor do them report DIMM thermal marginsDIMM Thrm Mrgn 1 | na | degrees C | na| na | na | na | 5.000 | 10.000| na DIMM Thrm Mrgn 2 | na | degrees C | na| na | na | na | 5.000 | 10.000| na DIMM Thrm Mrgn 3 | na | degrees C | na| na | na | na | 5.000 | 10.000| na DIMM Thrm Mrgn 4 | na | degrees C | na| na | na | na | 5.000 | 10.000| na
The end result is that all the fans in the system run at full RPM. The system itself is fully functional otherwise. BIOS is at version 2.04.0003, BMC is at 1.22.6890, ME is at 02.01.07.328 and FRU/SDR is at 1.12. Update was done from the latest EFI package for the board. The system event log has entries noting the processor thermal margin sensor failures.
I have reflashed all the images individually. I have verified that the SDR table is getting updated -- I slightly changed the name of the processor thermal sensors and verified that the new name showed up in the the RMM sensor table as well as an "ipmitool sensor".
The system is in a non-Intel chassis. Fans are connected for system fan 1-3, rear fan, and cpu fan 1 and 2. When updating the FRU/SDR records, the update script properly identifies it as an "other" chassis and prompts me for the various fan connections.
I'm looking for suggestions on what I might be missing here.
The pasted output is from "ipmitool sensor". The pertinent section of the RMM sensor page is below.P1 Statusreports the processor's presence has been detected OK0x0080 P2 Statusreports the processor's presence has been detected OK0x0080 P1 Therm MarginAll deassertedUnknownNot Available P2 Therm MarginAll deassertedUnknownNot Available P1 Therm Ctrl %All deassertedUnknownNot Available P2 Therm Ctrl %All deassertedUnknownNot Available P1 ERR2All deassertedOK0x0000 P2 ERR2All deassertedOK0x0000 CATERRAll deassertedOK0x0000 P1 MSID MismatchAll deassertedOK0x0000 CPU MissingAll deassertedOK0x0000 P1 DTS Therm MgnAll deassertedUnknownNot Available P2 DTS Therm MgnAll deassertedUnknownNot Available P2 MSID MismatchAll deassertedOK0x0000 P1 VRD HotAll deassertedOK0x0000 P2 VRD HotAll deassertedOK0x0000 P1 MEM01 VRD HotAll deassertedOK0x0000 P1 MEM23 VRD HotAll deassertedOK0x0000 P2 MEM01 VRD HotAll deassertedOK0x0000 P2 MEM23 VRD HotAll deassertedOK0x0000 DIMM Thrm Mrgn 1All deassertedUnknown<td style="padding-right: 1em; padding-left: 1em;...
Apologies for the delay. I had to wait for a time where I could tear the machine down.
The CPU part numbers are SR0L7. While I had the system apart, I put these two CPUs in a S2600GZ board and verified that they do in fact report processor thermal margin values. They do, as expected. I also took the opportunity to try a pair of E5-2609s (SR0LA) in the S2600CO4 board. These processors have also been verified to report thermal margin values. As with the others, in the S2600CO4 board, the sensors report as unavailable.
This would certainly point the finger at the S2600CO board as having something misconfigured, or wrong with it.
You could try re flashing the complete fw stack, especially the ME, BMC and SDRs.
I would also clear the BMC defaults which can be done with the syscfg -rbfd (i think) command. (You may need to do syscfg /? to get the help and find the restored BMC default command. )
I would not give this very high odds of working as it is more likely a damaged CPU pin or damage on the Mother board.
I am fairly certain it is not damage per se, as I have two boards that behave exactly the same way. However, it might be the boards. Prompted by your damage comment, I was looking at the second board in detail, just giving it a good looking over. Turns out it is an engineering sample board. Turns out both boards are. Now, I wouldn't normally expect that to be the cause. In the past, the engineering sample equipment we have gotten from Intel has been fully functional, if not at its final hardware rev. Usually it just means that we got it before it had completed certifications. I suppose that these boards could have not been fully functional yet, or that the ME connection to the processors could have been changed slightly such that release firmware expects things to be different. If that is the case, it will be disappointing. I hate to trash a couple of otherwise functional boards.
I will give the BMC reset a try, just to cover all the bases.
Engineering samples are meant for OEM (Original Equipment Manufacturers) and Intel provides these for testing purposes only. They may lack features that the production units will have. We strongly recommend returning these to the place of purchase or your Intel representative and request production units instead.
Well, that would require returning them directly to Intel as, at the time we purchased these boards, we were an Intel OEM. In cleaning up recently, these boards were discovered. We have a number of other engineering sample systems we use for various purposes in our hardware lab and it was decided to see if these boards could be put to use. It seems the answer is "sort of" as they appear to be fully functional, other than the broken CPU thermal sensors. While I expected some features to be missing or to not work, something as basic as CPU thermal sensors wasn't expected. They will just have to be used where the extra fan noise is not an issue. Not ideal, but not worth putting much effort into either.