When the server is idle all the temperatures on my server are below 45 C. 4 fans are throttled down to 800 RPM, but fan 3 is stuck at 8500 RPM and makes a very high noise. I've already updated BIOS/ME/FRU/SDR and the script correctly detects the chassis (redundant, non-HSBP), but the middle fan remains stuck. The PWM offset is set to 0, the fan profile is Acoustic and the CPU Power and Performance Policy is Balanced Power.
Is there anything I can do to improve this?
You have covered most troubleshooting tips we can recommend. You may probably want to enable the option to Clear Event Logs in BIOS > Server Management to refresh the BMC sensors.
Additionally, make sure you perform a full power cycle (by disconnecting power cords for about 30 seconds) to see if this helps.
If possible feel free to include the exact model of the chassis you are using. I would also recommend swapping SysFan3 into a different location to see if the issue follows the fan or the header on the board.
Thanks! I had already power cycled the machine a lot, I now tried Clear Event Logs and it didn't help.
The issue follows the header (I swapped fans 3 and 4, and fan 3 remains at 8000 RPM). Based on http://www.intel.com/support/motherboards/server/s2600cw/sb/CS-034913.htm Intel® Server Board S2600CW Family — Chassis Compatibility, the chassis should be a P4304XXMUXX. The server is a demo system from Intel so it does not show the chassis model name in the DMI data.
The Server board fan are controlled by the SDRs (Sensor Data Records).
They are part of the flash update package found here: https://downloadcenter.intel.com/Detail_Desc.aspx?DwnldID=24375 https://downloadcenter.intel.com/Detail_Desc.aspx?DwnldID=24375
When using an Intel chassis, the flash utility probes the hardware reading things like front panel, PSUs & HSBP type to determine which specific chassis you have so it can load the correct values. The utility does assume the fans are connected to the mother board correctly.
Since different customer use different hardware configurations, the FRUSDR package needs to be loaded when you assemble a new server.
First thing i would recommend is to down load the firmware package and flash the system. That fixes about 90% of all fan issues assuming the fans are connected correctly and working.
2nd up is read the SEL rather than clearing it. The SELview tool will display the SEL log in the OS or EFI https://downloadcenter.intel.com/Detail_Desc.aspx?DwnldID=24719 Get it here Intel® Download Center
or you can read the SEL in the BMC's embedded web server. This needs the BMC to be enabled in BIOS set-up with a password and user id. Connection to a network and a second system with a web browser to log into the CW system.
3rd The CW system uses 5 cooling domains which happen to match the system fans.
The fan domain structure allows the BMC to only ramp specific fans needed to cool a specific area.
FAN 3 / domain 3 is the central domain along with fan 2 & fan 4.
All 3 of these fans should behave the same.
I just open the FW update files and see something in the S2600CW.SDR that does not look right.
Almost at the end of the file is an entry for P4000 Redundant Fan SKU Chassis, Domain 2, Domain3, Domain4, All Profiles
which is incorrectly formatted and since Domain 3 is were you are having issue, it is pretty suspect.
Just we need to get this fixed!
1st step done already. 2nd step does not show anything incorrect.
3rd step... Great, if there's a new .SDR file that I can test on the machine, I can do that! Note that I'm having a problem with domain 2 (fan 3).
I took a look and noticed that the .SDR file version 1.05 has LF line endings instead of CR+LF in that section and in another one. I changed it to CR+LF and re-updated the FRU/SDR, but it didn't fix the problem.
I found a few problems in the .SDR file comments. For example, non-redundant SKUs have sensor C8h (Agg Thrm Mgn 1), whlie redundant SKUs have sensor C9h (Agg Thrm Mgn 2). However, the comments always mention Aggregate Thermal Margin 1 even when the record (correctly) refers to sensor C9h. But everything I found was only in the comments.
That should have fixed it. In fact, it should not be an issue except to us humans who like CR to keep things neat.
There is a new System update package you can try https://downloadcenter.intel.com/Detail_Desc.aspx?DwnldID=24732 https://downloadcenter.intel.com/Detail_Desc.aspx?DwnldID=24732
If that does not fix it, run the system info tool https://downloadcenter.intel.com/Detail_Desc.aspx?DwnldID=24718 https://downloadcenter.intel.com/Detail_Desc.aspx?DwnldID=24718
and post the results so we can see what is getting loaded for SDRs
The file looks great, except for fan 3 running at max speed,
Fans 3,4& 5 have a common thermal driver so all these fans should ramp together,
Pretty much comes down to hardware (fan driver) or maybe something really odd in the BMC.
You could try a BMC restore defaults (button in the web browse or using the syscfg tool syscfg -rbfd (i think-- you may need to check the syscfg -?)
High odds are a problem on the mother board.
Yeah, at some point it looks very much like a stuck PWM controller, if something like that can exist at all. I've already done a BMC restore using a jumper on the motherboard.
I'll engage customer support to have the motherboard replaced, thanks!