Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Highlighted
Beginner
1,822 Views

Random fan spin-up on R1304SPOSHBNR, take 2

Continuing from my earlier post at

It has been a while, but I now have an additional unit, this time with Intel certified memory, that is experiencing the same symptoms. Namely, at random intervals, ranging from 15 minutes to a few hours, the fans will spin up to full speed for a few seconds before returning to normal speed. I narrowed the cause down to the P1 Therm Margin sensor, which will occasionally be unreadable for a moment. See the attached video file to see this in action.

These servers are all running CentOS Linux.

# dmidecode 3.0

Getting SMBIOS data from sysfs.

SMBIOS 2.7 present.

64 structures occupying 3771 bytes.

Table at 0x80630000.

Handle 0x0001, DMI type 221, 12 bytes

OEM-specific Type

Header and Data:

DD 0C 01 00 01 01 00 04 01 00 08 00

Strings:

Reference Code - ACPI

Handle 0x0002, DMI type 133, 12 bytes

OEM-specific Type

Header and Data:

85 0C 02 00 00 50 F9 81 00 40 00 00

Handle 0x0006, DMI type 0, 24 bytes

BIOS Information

Vendor: Intel Corporation

Version: S1200SP.86B.03.01.0026.092720170729

Release Date: 09/27/2017

Address: 0xF0000

Runtime Size: 64 kB

ROM Size: 16384 kB

Characteristics:

PCI is supported

PNP is supported

BIOS is upgradeable

BIOS shadowing is allowed

Boot from CD is supported

Selectable boot is supported

EDD is supported

5.25"/1.2 MB floppy services are supported (int 13h)

3.5"/720 kB floppy services are supported (int 13h)

3.5"/2.88 MB floppy services are supported (int 13h)

Print screen service is supported (int 5h)

8042 keyboard services are supported (int 9h)

Serial services are supported (int 14h)

Printer services are supported (int 17h)

CGA/mono video services are supported (int 10h)

ACPI is supported

USB legacy is supported

LS-120 boot is supported

ATAPI Zip drive boot is supported

BIOS boot specification is supported

Function key-initiated network boot is supported

Targeted content distribution is supported

UEFI is supported

BIOS Revision: 0.0

Firmware Revision: 0.0

Handle 0x0007, DMI type 1, 27 bytes

System Information

Manufacturer: Intel Corporation

Product Name: S1200SP

Version: R1304SPOSHBNR

Serial Number: QSCD80400069

UUID: C81BB18B-33EF-E711-AB21-A4BF012C036E

Wake-up Type: Power Switch

SKU Number: SKU Number

Family: Family

Handle 0x0008, DMI type 2, 17 bytes

Base Board Information

Manufacturer: Intel Corporation

Product Name: S1200SP

Version: H57534-270

Serial Number: QSSA80100236

Asset Tag: Base Board Asset Tag

Features:

Board is a hosting board

Board is replaceable

Location In Chassis: Part Component

Chassis Handle: 0x0000

Type: Motherboard

Contained Object Handles: 0

Handle 0x0009, DMI type 3, 24 bytes

Chassis Information

Manufacturer: ...............................

Type: Rack Mount Chassis

Lock: Not Present

Version: ..................

Serial Number: ..................

Asset Tag: ....................

Boot-up State: Safe

Power Supply State: Safe

Thermal State: Safe

Security Status: None

OEM Information: 0x00000000

Height: Unspecified

Number Of Power Cords: Unspecified

Contained Elements: 0

SKU Number: Not Specified

Handle 0x000F, DMI type 11, 5 bytes

OEM Strings

String 1: To Be Filled By O.E.M.

Handle 0x0011, DMI type 13, 22 bytes

BIOS Language Information

Language Description Format: Long

Installable Languages: 1

en|US|iso8859-1

Currently Installed Language: en|US|iso8859-1

Handle 0x0012, DMI type 27, 15 bytes

Cooling Device

Temperature Probe Handle: 0x000B

Type: Fan

Status: OK

Cooling Unit Group: 1

OEM-specific Information: 0x00000000

Nominal Speed: Unknown Or Non-rotating

Description: Not Specified

Handle 0x0013, DMI type 28, 22 bytes

Temperature Probe

Description: LM78A

Location: System Management Module

Status:

Maximum Value: Unknown

Minimum Value: Unknown

Resolution: Unknown

Tolerance: Unknown

Accuracy: Unknown

OEM-specific Information: 0x00000000

Nominal Value: Unknown

Handle 0x0014, DMI type 32, 11 bytes

System Boot Information

Status: No errors detected

Handle 0x0015, DMI type 34, 11 bytes

Management Device

Description: UNKNOWN

Type: Unknown

Address: 0x00000000

Address Type: Unknown

Handle 0x0016, DMI type 35, 11 bytes

Management Device Component

Description: To Be Filled By O.E.M.

Management Device Handle: 0x000D

Component Handle: 0x000A

Threshold Handle: 0x000F

Handle 0x0017, DMI type 36, 16 bytes

Management Device Threshold Data

Handle 0x0018, DMI type 39, 22 bytes

System Power Supply

Power Unit Group: 1

Location: To Be Filled By O.E.M.

Name: To Be Filled By O.E.M.

Manufacturer: To Be Filled By O.E.M.

Serial Number: To Be Filled By O.E.M.

Asset Tag: To Be Filled By O.E.M.

Model Part Number: To Be Filled By O.E.M.

Revision: To Be Filled By O.E.M.

Max Power Cap...

0 Kudos
11 Replies
Highlighted
Community Manager
9 Views

Hello JohnMcC,

I was reading your case and also the previous case already created and I would like to ask you some details about the server system that you are using in order to find the root of the issue that you have been facing. Please, provide us the following information:

A) When did the issue started? How many server systems are been affected for the same issue?

B) Did you change something recently in the hardware or apply any update that could affect the system?

C) Are you using a ECC Memory or Unbuffered Memory RAM?

D) Did you change the location of the fan header?

E) Have you ever tried to set the BIOS at defaults settings?

The server board provides five SSI-compliant 4-pin fans to use as CPU and I/O cooling fans. 3-pin fans are supported on all fan headers. The pin configuration for each of the 4-pin fan headers is identical and defined in the following information.

  • One 4-pin fan header is designated as processor cooling fan: - CPU fan (J7K1)
  • Three 4-pin fan headers are designated as system fans: - System fan 1 (J3K2) - System fan 2 (J8K2) - System fan 3 (J8K3)
  • One 4-pin fan header is designated as a rear system fan: - System fan 4 (J8B1)

I will be waiting the outcome in order to proceed with the next step.

Best regards,

Emeth X

0 Kudos
Highlighted
Beginner
9 Views

Thank you for the reply and apologies for my slow reply. Things have been busy here.

A&B) The issue started as soon as the system was completed. This is a new build, and even with the latest firmware, it has had this problem out of the gate. The three other units mentioned in the previous thread also had this problem from the beginning. As such, no changes to the hardware have been made. I would also like to note that these four units in question are also the only four units we have, so I have no units that do not exhibit this problem.

C) The memory installed in this unit is two KVR24E17D8/16I Kingston DDR4 Unbuffered ECC Intel Certified memory.

D) The fan headers have not been changed. As you can see in the video, system fans 1 through 3 all show proper function. They only spin up when P1 Therm Margin becomes unresponsive.

E) The BIOS was reset to defaults after flashing to the latest version and then set to the required UEFI settings for CentOS.

These are all 1U systems, so it has no dedicated CPU fan. Here are the current temperature readouts for the processor:

]# sensors

acpitz-virtual-0

Adapter: Virtual device

temp1: +27.8°C (crit = +119.0°C)

temp2: +29.8°C (crit = +119.0°C)

coretemp-isa-0000

Adapter: ISA adapter

Physical id 0: +38.0°C (high = +80.0°C, crit = +100.0°C)

Core 0: +37.0°C (high = +80.0°C, crit = +100.0°C)

Core 1: +38.0°C (high = +80.0°C, crit = +100.0°C)

Core 2: +37.0°C (high = +80.0°C, crit = +100.0°C)

Core 3: +36.0°C (high = +80.0°C, crit = +100.0°C)

power_meter-acpi-0

Adapter: ACPI interface

power1: 27.00 W (interval = 1.00 s)

Since the fans only spin up to max for about five seconds, I doubt that the processor temperature is spiking to dangerous levels and then falling back to normal that quickly. System load on these are very steady in any event.

As the video shows, the fans coast at a normal ~4800 RPM until the P1 Therm Margin sensor drops, at which point they spin up to ~22000 RPM. Then once contact with P1 Therm Margin is re-established, the fans begin to return to their normal coasting speed. The question is what would cause P1 Therm Margin to be unresponsive for a second or two randomly every 15 minutes to few a hours. Since CentOS has no direct control over the fan speeds, the BMC must be what is spinning the fans up. Therefore, the BMC is losing contact with that sensor causing it to spin the fans up, so this likely a problem outside of the OS.

Thanks again for looking in to this.

0 Kudos
Highlighted
Community Manager
9 Views

Hello JohnMcC,

 

 

No worries, thank you so much for replying back.

 

 

I was reviewing the information of the previous community case and also the information provided of the new community case and I saw caught my attention.

 

I would like to continue asking you some probing question in order to determine the root of the issue due to the fact that you mentioned this issue is not only happening with this server system if not with three more systems. Let me inform you that this issue most of the time should have been fixed updating the BIOS/Firmware version. However, you already performed it and the issue still persists. So, let me provide you some troubleshooting steps in order to make sure the information and reject non-related issues.

 

 

 

A) Could you please so kind and replace the redundant power supplies?

 

(In some cases fans can stay high if redundant power is lost.)

 

 

B) Do you already check that your server system has the BMC Version: 1.12.11072?

 

 

C) Did you performed the BIOS updated using the following version "Intel® Server Board S1200SP

 

[Intel® Xeon® Processor E3-1200 v6 only] BIOS and Firmware Update Package for EFI" Version: 03.01.1029?

 

(https://downloadcenter.intel.com/download/27520/?product=97952 https://downloadcenter.intel.com/download/27520/?product=97952)

 

 

D) Have you performed a BMC Force Update Jumper (J4B1)?

 

*In order to see where is located the Jumper (J4B1) please check the picture attached*

 

 

The BMC Force Update jumper is used to put the BMC in Boot Recovery mode for a low-level update.

 

 

It is used when the BMC has become corrupted and is non-functional, requiring a new BMC image to be

 

loaded on to the server board.

 

1. Turn off the system and remove power cords.

 

 

2. Move the BMC FRC UPDT Jumper from the default (pins 1 and 2) operating position to the Force

 

Update position (pins 2 and 3).

 

 

3. Re-attach system power cords.

 

 

4. Power on the system.

 

 

Note: System Fans will boost and the BIOS Error Manager should report an 84F3 error code

 

(Baseboard Management Controller in update mode).

 

 

5. Boot to the EFI shell and update the BMC firmware using BMC# .NSH (where # is the

 

version number of the BMC). ( https://downloadcenter.intel.com/download/27520/?product=97952)

 

 

6. When update has successfully completed, power off system.

 

7. Remove AC power cords.

 

 

8. Move BMC FRC UPDT jumper back to the default position.

 

 

9. Install AC power cords.

 

 

10. Power on system.

 

 

11. Boot to the EFI shell and update the FRU and SDR data using FRUSDR# .nsh (where # is

 

the version number of the FRUSDR package).

 

 

12. Reboot the system.

 

 

13. Configure desired BMC configuration settings.

 

 

I will be waiting the outcome of this in order to proceed with the next step if it will be necessary.

 

 

Best Regards,

 

 

Emeth X
0 Kudos
Highlighted
Community Manager
9 Views

Hello JohnMcC,

 

 

I would like to know if you will need more assistance in this case in order to look up the best solution for you. I will be waiting the outcome, please do not hesitate and reply and I will be more than happy to assist you!

Best Regards,

Emeth X

0 Kudos
Highlighted
Beginner
9 Views

Hi Emeth,

I have confirmed that the BMC software is only at version 1.10.10925. I will attempt an update, but it may take some time as I am heading out on vacation for the next week, and I do not have ready access to the server at this time. I will update you when I have performed the update and checked to see if the situation has been resolved.

Thanks!

0 Kudos
Highlighted
Community Manager
9 Views

Hello JohnMcC,

No problem, take your time and if you have any other question please do not hesitate and let me know and I will be more than happy to help you. I hope the issue will be fixed with the update.

Have a wonderful day,

 

 

Best Regards,

Emeth X

0 Kudos
Highlighted
Community Manager
9 Views

Hello JohnMcC,

 

 

I would like to make sure some details, if you have the possibility to provide us the sysinfo log of your server system will help us a lot in order to identify more information about this issue.

Also, please provide us the LED status of the motherboard when the fans are running full speed. Does it go to another state? If so, when fans do ramp down does the status LED goes back to a solid green state?

 

Best Regards,

Emeth X

0 Kudos
Highlighted
Community Manager
9 Views

Hello JohnMcC,

I would like to see if you will need more assistance in this case. If so, please do not hesitate and let me know and I will be more than happy to assist you.

Best Regards,

Emeth X

0 Kudos
Highlighted
Beginner
9 Views

Hi Emeth,

Just to update you, I have not had the chance to update the firmware yet. The server was taken away from me, and now I have to coordinate the update with a few people. I'll let you know once I have the update in place.

Thank you for keeping tabs on this.

John

0 Kudos
Highlighted
Community Manager
9 Views

Hello,

No problem, if you have any other question just let us know and we will be more than happy to assist you.

Best Regards,

Emeth X.

0 Kudos
Highlighted
Community Manager
9 Views

Hello,

Any outcome about this case?

I would like to know if the issue still persists or if it is everything fine now.

Best Regards,

Emeth X

0 Kudos