Community
cancel
Showing results for 
Search instead for 
Did you mean: 
GMigu1
Beginner
4,816 Views

SC2600CO random reboot

Dears, I have a SC2600CP board in a server with 2 Xeon CPUs and 196GB of RAM.

This machine is used as a calculation node in a cluster environment, with other machines that has almost the same configuration.

A few days ago it started to reboot with no reason.

To try to identify the problem, I checked all DIMM slots and all the memory's looking for someone with error. I tested all of them but could not found an error.

Then I checked the SEL logs:

1 | 09/17/2017 | 16:38:10 | Event Logging Disabled # 0x07 | Log area reset/cleared | Asserted

2 | 09/17/2017 | 17:16:55 | Power Unit # 0x01 | Failure detected | Asserted

3 | 09/17/2017 | 17:16:56 | Power Unit # 0x01 | Power off/down | Asserted

4 | 09/17/2017 | 17:17:01 | Power Unit # 0x01 | Power off/down | Deasserted

5 | 09/17/2017 | 17:17:01 | Power Unit # 0x01 | Failure detected | Deasserted

6 | 09/17/2017 | 17:17:02 | Power Unit # 0x01 | Power off/down | Asserted

7 | 09/17/2017 | 17:17:07 | Power Unit # 0x01 | Power off/down | Deasserted

8 | 09/17/2017 | 17:17:13 | Fan # 0x32 | Lower Non-critical going low | Deasserted

9 | 09/17/2017 | 17:17:13 | Fan # 0x32 | Lower Critical going low | Deasserted

a | 09/17/2017 | 17:17:13 | Fan # 0x32 | Lower Non-critical going low | Deasserted

b | 09/17/2017 | 17:17:13 | Fan # 0x32 | Lower Critical going low | Deasserted

c | 09/17/2017 | 17:17:24 | Fan # 0x32 | Lower Non-critical going low | Asserted

d | 09/17/2017 | 17:17:24 | Fan # 0x32 | Lower Critical going low | Asserted

e | 09/17/2017 | 17:17:31 | System Event # 0x83 | Timestamp Clock Sync | Asserted

f | 09/17/2017 | 17:17:32 | System Event # 0x83 | Timestamp Clock Sync | Asserted

10 | 09/17/2017 | 17:17:55 | System Event # 0x83 | OEM System boot event | Asserted

and on the BMC web console:

3009/17/2017 17:39:32Pwr Unit StatusPower Unitreports the power unit is powered off or being powered down - Asserted2909/17/2017 17:37:19BIOS Evt SensorSystem Eventreports OEM System Boot Event - Asserted2809/17/2017 17:36:56BIOS Evt SensorSystem Eventreports Timestamp Clock Sync. Event is one of two expected events from BIOS on every power on. - Asserted2709/17/2017 17:36:56BIOS Evt SensorSystem Eventreports Timestamp Clock Sync. Event is one of two expected events from BIOS on every power on. - Asserted2609/17/2017 17:36:49System Fan 3Fanreports the sensor is in a low, critical, and going lower state - Asserted2509/17/2017 17:36:49System Fan 3Fanreports the sensor is in a low, but non-critical, and going lower state - Asserted2409/17/2017 17:36:36System Fan 3Fanreports the sensor is in a low, critical, and going lower state - Deasserted2309/17/2017 17:36:36System Fan 3Fan...
0 Kudos
36 Replies
idata
Community Manager
235 Views

Hi guilhermefsmiguel,

 

 

I am Mike and it is a pleasure to assist you.

 

 

The Intel® Server Board S2600COE (G29920-205) is rebooting randomly and the SEL utility is showing the power supply # 1 as faulty and the new power supply did not solve the issue.

 

 

I have noted, your system is running the FRU 1.0, this version was released on 2012; so my first recommendation is updating the BIOS-firmware of the server. Before, doing it run our Intel® System Support Utility and send us the results; according to the current BIOS-firmware version, I will let you know which version of BIOS you need to use for the update. If we jump to the latest version 02.06.0006 (7/20/2017), the server might stop working properly.

 

 

https://downloadcenter.intel.com/product/91600/Intel-System-Support-Utility Downloads for Intel® System Support Utility ( Windows* and Linux*

 

 

I would be waiting for your results for further assistance.

 

 

Regards,

 

Mike C

 

GMigu1
Beginner
235 Views

Hi Mike, thank you for the support.

As you requested, I have executed the ssu.sh on the machine and the result is attached.

I forgot to tell on the first message that when I figure out that the problem was not a memory or power problem, I tried to update the BIOS, but the FRU could not be upgraded.

Once again, thank you for the support

GMigu1
Beginner
235 Views

Mike, I also need to tell you that I am not using any PCI express card or another hard disk than the on that is on the report.

We use a SSD as SWAP, but this disk is deactivated, reason why you may see that there is no virtual memory available.

This machine it is not inserted as a node now, so there is no processing load on it. It is configured now as under maintenance.

Thank you for the support,

M.Eng. Guilherme Fernandes de Souza Miguel

GMigu1
Beginner
235 Views

Mike, I am also attaching the ssu report with 3rd party log messages.

Thank you,

idata
Community Manager
235 Views

Hi guilhermefsmiguel,

 

 

I noted you have updated the BIOS to the latest version 02.06.0006; however, the logs are not showing the current version of the ME and BMC firmware.

 

 

Please run our application https://downloadcenter.intel.com/download/26991/System-Information-Retrieval-Utility-SysInfo- Intel® System Information Retrieval Utility and send me the results.

 

 

Additionally, send us the model of the chassis; if you are using and Intel® System, please add the product code of the chassis.

 

 

Regards,Mike C

 

GMigu1
Beginner
235 Views

Hi Mike,

Here are the files you have requested.

Please note that OpenSuSE LEAP 42.1 doesn't have a /var/log/messages

Journalctrl is the option now. If it is necessary I can try to send it to you.

Best regards,

idata
Community Manager
235 Views

Hi guilhermefsmiguel,

 

 

Thank you for your update. The BMC and ME firmware versions are updated; however, the FRUSDR is not updated yet. The system is using the version 1.08.

 

 

Let's try to update it using an older version 1.09. Use the BIOS-Firmware https://downloadcenter.intel.com/download/22399/Intel-Server-Board-S2600CO-Firmware-Update-Package-f... version 01.06.0002R4151 following the steps below:

 

https://downloadcenter.intel.com/download/22399/Intel-Server-Board-S2600CO-Firmware-Update-Package-f...

 

FRUSDR update steps:

 

1) Boot the system to the EFI shell and go to root folder

 

2) At the EFI command prompt, run "FRUSDR.nsh" to start FRUSDR update

 

3) Answer questions and enter desired information when prompted.

 

4) When complete, reboot the system by front control panel

 

 

Verify if the FRUSDR update works:

 

1) During POST, hit the F2 Key when prompted to access the BIOS Setup Utility

 

2) Hit the F9 key to load BIOS Defaults, then hit the F10 (save changes)

 

3) At the MAIN menu verify the BIOS revision is 02.06.0006

 

4) Move cursor to the SERVER MANAGEMENT Menu

 

5) Move cursor down to the SYSTEM INFORMATION Option and hit Enter

 

6) Verify the BMC Firmware revision is 01.28.10603

 

7) Verify the SDR revision is 1.09

 

8) Verify the ME Firmware revision is 02.01.07.328

 

9) Hit the F10 Key to save changes and Exit

 

 

If it works, do the same with FRUSDR https://downloadcenter.intel.com/download/23156/Intel-Server-Board-S2600CO-Firmware-Update-Package-f... version 1.11

 

 

I would be waiting for the outcome of this workaround. Let me know the brand name and model of the chassis.

 

 

Regards,

 

Mike C
GMigu1
Beginner
235 Views

Hi Mike,

Fist os all , thank you for your time and help.

Checking your script I did not saw any reference of jumper change on the motherboard so I can assume that this is not necessary, right?

I am travelling and will return to the university on Friday, reason why I ask you: Do you think that it's better to wait until Friday to execute this procedure on site or it is safe to execute this procedure using the SOL?

I know that there are risks involved on any firmware upgrade, but I am not sure whatever a BMC restarts during his update, been the update process controlled by a SOL session, can make it faulty.

If it is not necessary to change jumpers position and there is no additional risk on doing this upgrade via SOL, I will ask another person to download the software to a USB drive and insert it on the machine to proceed with the update.

Thank you once again,

idata
Community Manager
235 Views

Hi guilhermefsmiguel,

 

 

It is my pleasure to assist you.

 

 

The FRUSDR update does not require to remove a jumper from the board itself, we can do it using the EFI shell.

 

 

I suggest you updating the FRUSDR firmware physically instead of the remote mode. The BIOS might get corrupted if we try this option.

 

 

Let me know how the workaround works at your convenience, I will be waiting for your results.

 

 

Regards,

 

Mike C
GMigu1
Beginner
235 Views

Hi Mike,

I have upgraded the FRU firmware to the versions you have recommended and the screens that you asked me to confirm the version are attached.

But the problem persist.

1a | 09/22/2017 | 14:02:59 | System Event # 0x83 | OEM System boot event | Asserted

1b | 09/22/2017 | 14:04:16 | Power Unit # 0x01 | Failure detected | Asserted

1c | 09/22/2017 | 14:04:16 | Power Unit # 0x01 | Power off/down | Asserted

1d | 09/22/2017 | 14:04:21 | Power Unit # 0x01 | Power off/down | Deasserted

1e | 09/22/2017 | 14:04:21 | Power Unit # 0x01 | Failure detected | Deasserted

1f | 09/22/2017 | 14:04:54 | System Event # 0x83 | Timestamp Clock Sync | Asserted

20 | 09/22/2017 | 14:04:54 | System Event # 0x83 | Timestamp Clock Sync | Asserted

21 | 09/22/2017 | 14:05:19 | System Event # 0x83 | OEM System boot event | Asserted

22 | 09/22/2017 | 14:09:59 | Power Unit # 0x01 | Failure detected | Asserted

23 | 09/22/2017 | 14:09:59 | Power Unit # 0x01 | Power off/down | Asserted

24 | 09/22/2017 | 14:10:04 | Power Unit # 0x01 | Power off/down | Deasserted

25 | 09/22/2017 | 14:10:04 | Power Unit # 0x01 | Failure detected | Deasserted

26 | 09/22/2017 | 14:10:35 | System Event # 0x83 | Timestamp Clock Sync | Asserted

27 | 09/22/2017 | 14:10:35 | System Event # 0x83 | Timestamp Clock Sync | Asserted

28 | 09/22/2017 | 14:11:01 | System Event # 0x83 | OEM System boot event | Asserted

29 | 09/22/2017 | 14:14:50 | Power Unit # 0x01 | Failure detected | Asserted

2a | 09/22/2017 | 14:14:51 | Power Unit # 0x01 | Power off/down | Asserted

2b | 09/22/2017 | 14:14:56 | Power Unit # 0x01 | Power off/down | Deasserted

2c | 09/22/2017 | 14:15:08 | Power Unit # 0x01 | Failure detected | Deasserted

2d | 09/22/2017 | 14:15:26 | System Event # 0x83 | Timestamp Clock Sync | Asserted

2e | 09/22/2017 | 14:15:27 | System Event # 0x83 | Timestamp Clock Sync | Asserted

2f | 09/22/2017 | 14:15:52 | System Event # 0x83 | OEM System boot event | Asserted

30 | 09/22/2017 | 14:17:59 | Power Unit # 0x01 | Failure detected | Asserted

31 | 09/22/2017 | 14:18:00 | Power Unit # 0x01 | Power off/down | Asserted

32 | 09/22/2017 | 14:18:05 | Power Unit # 0x01 | Failure detected | Deasserted

33 | 09/22/2017 | 14:18:10 | Power Unit # 0x01 | Power off/down | Deasserted

34 | 09/22/2017 | 14:18:35 | System Event # 0x83 | Timestamp Clock Sync | Asserted

35 | 09/22/2017 | 14:18:35 | System Event # 0x83 | Timestamp Clock Sync | Asserted

36 | 09/22/2017 | 14:23:09 | System Event # 0x83 | OEM System boot event | Asserted

Do you think that it is convenient to retry testing the power source?

GMigu1
Beginner
235 Views

I would like to mention that the machine restart when it boots into Linux, even without any CPU or RAM consuption. If I load the EFI sheel, or put it into BIOS it doesn't restart.

I have googled about it, but I could not find the exact same problem.

I have attached the info that it is shown in the BMC WEB interface.

Thank you once again for your time and support.

GMigu1
Beginner
235 Views

Dear Mike, another Professor told me a few seconds ago that he saw the machine restarting even when it was on the EFI Shell.

So please, do not consider my last affirmation that it only restarts when it is booted on Linux.

idata
Community Manager
235 Views

Hi guilhermefsmiguel,

 

 

Thank you for your update. The system is still showing the power supply as faulty even with the FRUSDR: 1.11.

 

 

I suggest you to update the FRUSDR to the latest version 1.12 (https://downloadcenter.intel.com/download/23917/Intel-Server-Board-S2600CO-Firmware-Update-Package-f... Version: 02.03.0003). Hopefully, it will solve the issue. Keep using the same method.

 

 

FRUSDR update steps:

 

1) Boot the system to the EFI shell and go to root folder

 

2) At the EFI command prompt, run "FRUSDR.nsh" to start FRUSDR update

 

3) Answer questions and enter desired information when prompted.

 

4) When complete, reboot the system by front control panel

 

 

If the problem continues, double check if OpenSuSE LEAP 42.1 is up to date.

 

 

Please, keep us posted with the results.

 

 

Regards,

 

Mike C
idata
Community Manager
235 Views

Hi Guilhermefsmiguel,

 

 

Thank you for your update. I am interested to know if you are still having issues with the Intel® Server Board S2600COE.

 

 

Regards,

 

Mike C

 

GMigu1
Beginner
235 Views

Hi Mike,

I did not inform you that I only can make this activities on Thursday or Friday, because I am working for two Universities. One of them I am a Professor and Researcher. On the other one, only a Researcher.

This Universities are in different cities, reason why I stay half of my week in each city.

But, last week I had to stay the hole week outside Uberlândia, where the cluster is located and I will return there on this Thursday.

So, as soon as I get the firmware upgraded and get the results I will let you know.

Thank you for your time and support.

idata
Community Manager
235 Views

Hi, Guilhermefsmiguel,

 

 

I will be waiting for your outcome, thank you for your update.

 

 

Regards,

 

Mike C
GMigu1
Beginner
235 Views

Hi Mike, sorry about the delayed response.

Yesterday I made the FRU upgrade but it doesn't solve the problem.

I also checked my Linux and it was updated. I tried to install a newer version ( LEAP 42.3) but it was not possible, due to constant restarts.

I Attached the Main and System Info screens to you.

I have replaced the power source with another one that has the same model, to check once again if the problem was on the power source. The random reboots persist.

If the machine is in BIOS or in Internal EFI Shell it doesn't randomly reboots. When booted via pxe, HD or USB it reboots.

I have recorded the beeps audio on my phone, but it was not possible to attach them in this message. I'll try to do it later.

The sel list command:

17 | 10/06/2017 | 16:05:43 | System Event # 0x83 | Timestamp Clock Sync | Asserted

18 | 10/06/2017 | 16:06:08 | System Event # 0x83 | OEM System boot event | Asserted

19 | 10/06/2017 | 16:06:53 | Power Unit # 0x01 | Failure detected | Asserted

1a | 10/06/2017 | 16:06:53 | Power Unit # 0x01 | Power off/down | Asserted

1b | 10/06/2017 | 16:06:58 | Power Unit # 0x01 | Power off/down | Deasserted

1c | 10/06/2017 | 16:06:59 | Power Unit # 0x01 | Power off/down | Asserted

1d | 10/06/2017 | 16:06:59 | Power Unit # 0x01 | Failure detected | Deasserted

1e | 10/06/2017 | 16:07:04 | Power Unit # 0x01 | Power off/down | Deasserted

1f | 10/06/2017 | 16:07:29 | System Event # 0x83 | Timestamp Clock Sync | Asserted

20 | 10/06/2017 | 16:07:29 | System Event # 0x83 | Timestamp Clock Sync | Asserted

21 | 10/06/2017 | 16:07:54 | System Event # 0x83 | OEM System boot event | Asserted

22 | 10/06/2017 | 16:09:53 | Power Unit # 0x01 | Failure detected | Asserted

23 | 10/06/2017 | 16:09:54 | Power Unit # 0x01 | Power off/down | Asserted

24 | 10/06/2017 | 16:09:59 | Power Unit # 0x01 | Failure detected | Deasserted

25 | 10/06/2017 | 16:10:04 | Power Unit # 0x01 | Power off/down | Deasserted

26 | 10/06/2017 | 16:10:29 | System Event # 0x83 | Timestamp Clock Sync | Asserted

27 | 10/06/2017 | 16:10:29 | System Event # 0x83 | Timestamp Clock Sync | Asserted

28 | 10/06/2017 | 16:10:55 | System Event # 0x83 | OEM System boot event | Asserted

29 | 10/06/2017 | 16:12:29 | Power Unit # 0x01 | Failure detected | Asserted

2a | 10/06/2017 | 16:12:30 | Power Unit # 0x01 | Power off/down | Asserted

2b | 10/06/2017 | 16:12:35 | Power Unit # 0x01 | Failure detected | De

Please let me know if there is any other information that I need to report you,

Best regards,

idata
Community Manager
235 Views

Hi, Guilhermefsmiguel,

 

 

I reviewed the results and noted that you have updated all the components of the BIOS-firmware of the Intel® Server Board S2600COE and you are still getting random reboots.

 

 

According to the previous troubleshooting, we can narrow down the issue to the operating system or it could be a problem with the power distribution board included with the chassis.

 

 

If you have an Intel chassis, please send me the product code (usually outside of the chassis), and PBA number and the serial number (it is necessary to remove it from the chassis) of the power distribution board, I will review the warranty status and proceed with the replacement of it.

 

 

Regards,

 

Mike C
GMigu1
Beginner
235 Views

Hi Mike, I need to thank you for your time and support on this issue.

Attached are all of the info that is attached to the board, cabinet and power source. One of the students took the pictures for me, to speed-up the process.

I need to ask you if this problem could not be caused by the boar didn't detecting the Linux. It is strange, since we are using the same Linux version and installation method on all of the machines. Do you think, for example that it could be related to UEFI boot? Because, when I leave the machine on the grub2 screen ( not booting the OS ), it did not restart ( I made a test a few days ago ).

It can be a misunderstanding from me, but I believe that upgrading the BIOS etc, the time between the boot into Linux and reboot got smaller.

Once again, thank you for your time and support!

idata
Community Manager
96 Views

Hi, Guilhermefsmiguel,

 

 

Following your case with random reboots on the Intel® Server Board S2600COE.

 

 

I have reviewed the pictures of your system, everything looks well. The wattage of the power supply is good enough. The server chassis was not designed for this board but it is possible to connect the power supply to the motherboard without a power bridge.

 

 

https://www.intel.com/content/dam/support/us/en/documents/motherboards/server/s2600co/sb/s2600co_con... s2600CO configuration guide

 

 

Now, OpenSuSE LEAP 42.1 has not been validated by our engineer department, so you can see random issues.

 

 

https://www.intel.com/content/www/us/en/support/articles/000007556/server-products/server-boards.htm... Operating System Compatibility for Intel® Server Board S2600CO Family

 

 

I suggest you to double check OpenSuse Leap support website and review UEFI compatibility documents. Double check if your system is already using grub.efi and shim.efi

 

 

https://forums.opensuse.org/showthread.php/511105-UEFI-install-issues-with-LEAP-42-1 UEFI install issues with LEAP 42.1

 

 

https://en.opensuse.org/openSUSE:UEFI openSUSE:UEFI

 

 

Regards,

 

Mike C
Reply