- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dears, I have a SC2600CP board in a server with 2 Xeon CPUs and 196GB of RAM.
This machine is used as a calculation node in a cluster environment, with other machines that has almost the same configuration.
A few days ago it started to reboot with no reason.
To try to identify the problem, I checked all DIMM slots and all the memory's looking for someone with error. I tested all of them but could not found an error.
Then I checked the SEL logs:
1 | 09/17/2017 | 16:38:10 | Event Logging Disabled # 0x07 | Log area reset/cleared | Asserted
2 | 09/17/2017 | 17:16:55 | Power Unit # 0x01 | Failure detected | Asserted
3 | 09/17/2017 | 17:16:56 | Power Unit # 0x01 | Power off/down | Asserted
4 | 09/17/2017 | 17:17:01 | Power Unit # 0x01 | Power off/down | Deasserted
5 | 09/17/2017 | 17:17:01 | Power Unit # 0x01 | Failure detected | Deasserted
6 | 09/17/2017 | 17:17:02 | Power Unit # 0x01 | Power off/down | Asserted
7 | 09/17/2017 | 17:17:07 | Power Unit # 0x01 | Power off/down | Deasserted
8 | 09/17/2017 | 17:17:13 | Fan # 0x32 | Lower Non-critical going low | Deasserted
9 | 09/17/2017 | 17:17:13 | Fan # 0x32 | Lower Critical going low | Deasserted
a | 09/17/2017 | 17:17:13 | Fan # 0x32 | Lower Non-critical going low | Deasserted
b | 09/17/2017 | 17:17:13 | Fan # 0x32 | Lower Critical going low | Deasserted
c | 09/17/2017 | 17:17:24 | Fan # 0x32 | Lower Non-critical going low | Asserted
d | 09/17/2017 | 17:17:24 | Fan # 0x32 | Lower Critical going low | Asserted
e | 09/17/2017 | 17:17:31 | System Event # 0x83 | Timestamp Clock Sync | Asserted
f | 09/17/2017 | 17:17:32 | System Event # 0x83 | Timestamp Clock Sync | Asserted
10 | 09/17/2017 | 17:17:55 | System Event # 0x83 | OEM System boot event | Asserted
and on the BMC web console:
3009/17/2017 17:39:32Pwr Unit StatusPower Unitreports the power unit is powered off or being powered down - Asserted2909/17/2017 17:37:19BIOS Evt SensorSystem Eventreports OEM System Boot Event - Asserted2809/17/2017 17:36:56BIOS Evt SensorSystem Eventreports Timestamp Clock Sync. Event is one of two expected events from BIOS on every power on. - Asserted2709/17/2017 17:36:56BIOS Evt SensorSystem Eventreports Timestamp Clock Sync. Event is one of two expected events from BIOS on every power on. - Asserted2609/17/2017 17:36:49System Fan 3Fanreports the sensor is in a low, critical, and going lower state - Asserted2509/17/2017 17:36:49System Fan 3Fanreports the sensor is in a low, but non-critical, and going lower state - Asserted2409/17/2017 17:36:36System Fan 3Fanreports the sensor is in a low, critical, and going lower state - Deasserted2309/17/2017 17:36:36System Fan 3Fan...Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi guilhermefsmiguel,
I am Mike and it is a pleasure to assist you.
The Intel® Server Board S2600COE (G29920-205) is rebooting randomly and the SEL utility is showing the power supply # 1 as faulty and the new power supply did not solve the issue.
I have noted, your system is running the FRU 1.0, this version was released on 2012; so my first recommendation is updating the BIOS-firmware of the server. Before, doing it run our Intel® System Support Utility and send us the results; according to the current BIOS-firmware version, I will let you know which version of BIOS you need to use for the update. If we jump to the latest version 02.06.0006 (7/20/2017), the server might stop working properly.
https://downloadcenter.intel.com/product/91600/Intel-System-Support-Utility Downloads for Intel® System Support Utility ( Windows* and Linux*
I would be waiting for your results for further assistance.
Regards,
Mike C
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Mike, thank you for the support.
As you requested, I have executed the ssu.sh on the machine and the result is attached.
I forgot to tell on the first message that when I figure out that the problem was not a memory or power problem, I tried to update the BIOS, but the FRU could not be upgraded.
Once again, thank you for the support
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Mike, I also need to tell you that I am not using any PCI express card or another hard disk than the on that is on the report.
We use a SSD as SWAP, but this disk is deactivated, reason why you may see that there is no virtual memory available.
This machine it is not inserted as a node now, so there is no processing load on it. It is configured now as under maintenance.
Thank you for the support,
M.Eng. Guilherme Fernandes de Souza Miguel
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Mike, I am also attaching the ssu report with 3rd party log messages.
Thank you,
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi guilhermefsmiguel,
I noted you have updated the BIOS to the latest version 02.06.0006; however, the logs are not showing the current version of the ME and BMC firmware.
Please run our application https://downloadcenter.intel.com/download/26991/System-Information-Retrieval-Utility-SysInfo- Intel® System Information Retrieval Utility and send me the results.
Additionally, send us the model of the chassis; if you are using and Intel® System, please add the product code of the chassis.
Regards,Mike C
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Mike,
Here are the files you have requested.
Please note that OpenSuSE LEAP 42.1 doesn't have a /var/log/messages
Journalctrl is the option now. If it is necessary I can try to send it to you.
Best regards,
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you for your update. The BMC and ME firmware versions are updated; however, the FRUSDR is not updated yet. The system is using the version 1.08.
Let's try to update it using an older version 1.09. Use the BIOS-Firmware https://downloadcenter.intel.com/download/22399/Intel-Server-Board-S2600CO-Firmware-Update-Package-for-Extensible-Firmware-Interface-EFI-?product=63157 version 01.06.0002R4151 following the steps below:
https://downloadcenter.intel.com/download/22399/Intel-Server-Board-S2600CO-Firmware-Update-Package-for-Extensible-Firmware-Interface-EFI-?product=63157
FRUSDR update steps:
1) Boot the system to the EFI shell and go to root folder
2) At the EFI command prompt, run "FRUSDR.nsh" to start FRUSDR update
3) Answer questions and enter desired information when prompted.
4) When complete, reboot the system by front control panel
Verify if the FRUSDR update works:
1) During POST, hit the F2 Key when prompted to access the BIOS Setup Utility
2) Hit the F9 key to load BIOS Defaults, then hit the F10 (save changes)
3) At the MAIN menu verify the BIOS revision is 02.06.0006
4) Move cursor to the SERVER MANAGEMENT Menu
5) Move cursor down to the SYSTEM INFORMATION Option and hit Enter
6) Verify the BMC Firmware revision is 01.28.10603
7) Verify the SDR revision is 1.09
8) Verify the ME Firmware revision is 02.01.07.328
9) Hit the F10 Key to save changes and Exit
If it works, do the same with FRUSDR https://downloadcenter.intel.com/download/23156/Intel-Server-Board-S2600CO-Firmware-Update-Package-for-Extensible-Firmware-Interface-EFI-?product=63157 version 1.11
I would be waiting for the outcome of this workaround. Let me know the brand name and model of the chassis.
Regards,
Mike C
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Mike,
Fist os all , thank you for your time and help.
Checking your script I did not saw any reference of jumper change on the motherboard so I can assume that this is not necessary, right?
I am travelling and will return to the university on Friday, reason why I ask you: Do you think that it's better to wait until Friday to execute this procedure on site or it is safe to execute this procedure using the SOL?
I know that there are risks involved on any firmware upgrade, but I am not sure whatever a BMC restarts during his update, been the update process controlled by a SOL session, can make it faulty.
If it is not necessary to change jumpers position and there is no additional risk on doing this upgrade via SOL, I will ask another person to download the software to a USB drive and insert it on the machine to proceed with the update.
Thank you once again,
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi guilhermefsmiguel,
It is my pleasure to assist you.
The FRUSDR update does not require to remove a jumper from the board itself, we can do it using the EFI shell.
I suggest you updating the FRUSDR firmware physically instead of the remote mode. The BIOS might get corrupted if we try this option.
Let me know how the workaround works at your convenience, I will be waiting for your results.
Regards,
Mike C
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Mike,
I have upgraded the FRU firmware to the versions you have recommended and the screens that you asked me to confirm the version are attached.
But the problem persist.
1a | 09/22/2017 | 14:02:59 | System Event # 0x83 | OEM System boot event | Asserted
1b | 09/22/2017 | 14:04:16 | Power Unit # 0x01 | Failure detected | Asserted
1c | 09/22/2017 | 14:04:16 | Power Unit # 0x01 | Power off/down | Asserted
1d | 09/22/2017 | 14:04:21 | Power Unit # 0x01 | Power off/down | Deasserted
1e | 09/22/2017 | 14:04:21 | Power Unit # 0x01 | Failure detected | Deasserted
1f | 09/22/2017 | 14:04:54 | System Event # 0x83 | Timestamp Clock Sync | Asserted
20 | 09/22/2017 | 14:04:54 | System Event # 0x83 | Timestamp Clock Sync | Asserted
21 | 09/22/2017 | 14:05:19 | System Event # 0x83 | OEM System boot event | Asserted
22 | 09/22/2017 | 14:09:59 | Power Unit # 0x01 | Failure detected | Asserted
23 | 09/22/2017 | 14:09:59 | Power Unit # 0x01 | Power off/down | Asserted
24 | 09/22/2017 | 14:10:04 | Power Unit # 0x01 | Power off/down | Deasserted
25 | 09/22/2017 | 14:10:04 | Power Unit # 0x01 | Failure detected | Deasserted
26 | 09/22/2017 | 14:10:35 | System Event # 0x83 | Timestamp Clock Sync | Asserted
27 | 09/22/2017 | 14:10:35 | System Event # 0x83 | Timestamp Clock Sync | Asserted
28 | 09/22/2017 | 14:11:01 | System Event # 0x83 | OEM System boot event | Asserted
29 | 09/22/2017 | 14:14:50 | Power Unit # 0x01 | Failure detected | Asserted
2a | 09/22/2017 | 14:14:51 | Power Unit # 0x01 | Power off/down | Asserted
2b | 09/22/2017 | 14:14:56 | Power Unit # 0x01 | Power off/down | Deasserted
2c | 09/22/2017 | 14:15:08 | Power Unit # 0x01 | Failure detected | Deasserted
2d | 09/22/2017 | 14:15:26 | System Event # 0x83 | Timestamp Clock Sync | Asserted
2e | 09/22/2017 | 14:15:27 | System Event # 0x83 | Timestamp Clock Sync | Asserted
2f | 09/22/2017 | 14:15:52 | System Event # 0x83 | OEM System boot event | Asserted
30 | 09/22/2017 | 14:17:59 | Power Unit # 0x01 | Failure detected | Asserted
31 | 09/22/2017 | 14:18:00 | Power Unit # 0x01 | Power off/down | Asserted
32 | 09/22/2017 | 14:18:05 | Power Unit # 0x01 | Failure detected | Deasserted
33 | 09/22/2017 | 14:18:10 | Power Unit # 0x01 | Power off/down | Deasserted
34 | 09/22/2017 | 14:18:35 | System Event # 0x83 | Timestamp Clock Sync | Asserted
35 | 09/22/2017 | 14:18:35 | System Event # 0x83 | Timestamp Clock Sync | Asserted
36 | 09/22/2017 | 14:23:09 | System Event # 0x83 | OEM System boot event | Asserted
Do you think that it is convenient to retry testing the power source?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I would like to mention that the machine restart when it boots into Linux, even without any CPU or RAM consuption. If I load the EFI sheel, or put it into BIOS it doesn't restart.
I have googled about it, but I could not find the exact same problem.
I have attached the info that it is shown in the BMC WEB interface.
Thank you once again for your time and support.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dear Mike, another Professor told me a few seconds ago that he saw the machine restarting even when it was on the EFI Shell.
So please, do not consider my last affirmation that it only restarts when it is booted on Linux.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi guilhermefsmiguel,
Thank you for your update. The system is still showing the power supply as faulty even with the FRUSDR: 1.11.
I suggest you to update the FRUSDR to the latest version 1.12 (https://downloadcenter.intel.com/download/23917/Intel-Server-Board-S2600CO-Firmware-Update-Package-for-Extensible-Firmware-Interface-EFI-?product=63157 Version: 02.03.0003). Hopefully, it will solve the issue. Keep using the same method.
FRUSDR update steps:
1) Boot the system to the EFI shell and go to root folder
2) At the EFI command prompt, run "FRUSDR.nsh" to start FRUSDR update
3) Answer questions and enter desired information when prompted.
4) When complete, reboot the system by front control panel
If the problem continues, double check if OpenSuSE LEAP 42.1 is up to date.
Please, keep us posted with the results.
Regards,
Mike C
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Guilhermefsmiguel,
Thank you for your update. I am interested to know if you are still having issues with the Intel® Server Board S2600COE.
Regards,
Mike C
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Mike,
I did not inform you that I only can make this activities on Thursday or Friday, because I am working for two Universities. One of them I am a Professor and Researcher. On the other one, only a Researcher.
This Universities are in different cities, reason why I stay half of my week in each city.
But, last week I had to stay the hole week outside Uberlândia, where the cluster is located and I will return there on this Thursday.
So, as soon as I get the firmware upgraded and get the results I will let you know.
Thank you for your time and support.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi, Guilhermefsmiguel,
I will be waiting for your outcome, thank you for your update.
Regards,
Mike C
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Mike, sorry about the delayed response.
Yesterday I made the FRU upgrade but it doesn't solve the problem.
I also checked my Linux and it was updated. I tried to install a newer version ( LEAP 42.3) but it was not possible, due to constant restarts.
I Attached the Main and System Info screens to you.
I have replaced the power source with another one that has the same model, to check once again if the problem was on the power source. The random reboots persist.
If the machine is in BIOS or in Internal EFI Shell it doesn't randomly reboots. When booted via pxe, HD or USB it reboots.
I have recorded the beeps audio on my phone, but it was not possible to attach them in this message. I'll try to do it later.
The sel list command:
17 | 10/06/2017 | 16:05:43 | System Event # 0x83 | Timestamp Clock Sync | Asserted
18 | 10/06/2017 | 16:06:08 | System Event # 0x83 | OEM System boot event | Asserted
19 | 10/06/2017 | 16:06:53 | Power Unit # 0x01 | Failure detected | Asserted
1a | 10/06/2017 | 16:06:53 | Power Unit # 0x01 | Power off/down | Asserted
1b | 10/06/2017 | 16:06:58 | Power Unit # 0x01 | Power off/down | Deasserted
1c | 10/06/2017 | 16:06:59 | Power Unit # 0x01 | Power off/down | Asserted
1d | 10/06/2017 | 16:06:59 | Power Unit # 0x01 | Failure detected | Deasserted
1e | 10/06/2017 | 16:07:04 | Power Unit # 0x01 | Power off/down | Deasserted
1f | 10/06/2017 | 16:07:29 | System Event # 0x83 | Timestamp Clock Sync | Asserted
20 | 10/06/2017 | 16:07:29 | System Event # 0x83 | Timestamp Clock Sync | Asserted
21 | 10/06/2017 | 16:07:54 | System Event # 0x83 | OEM System boot event | Asserted
22 | 10/06/2017 | 16:09:53 | Power Unit # 0x01 | Failure detected | Asserted
23 | 10/06/2017 | 16:09:54 | Power Unit # 0x01 | Power off/down | Asserted
24 | 10/06/2017 | 16:09:59 | Power Unit # 0x01 | Failure detected | Deasserted
25 | 10/06/2017 | 16:10:04 | Power Unit # 0x01 | Power off/down | Deasserted
26 | 10/06/2017 | 16:10:29 | System Event # 0x83 | Timestamp Clock Sync | Asserted
27 | 10/06/2017 | 16:10:29 | System Event # 0x83 | Timestamp Clock Sync | Asserted
28 | 10/06/2017 | 16:10:55 | System Event # 0x83 | OEM System boot event | Asserted
29 | 10/06/2017 | 16:12:29 | Power Unit # 0x01 | Failure detected | Asserted
2a | 10/06/2017 | 16:12:30 | Power Unit # 0x01 | Power off/down | Asserted
2b | 10/06/2017 | 16:12:35 | Power Unit # 0x01 | Failure detected | De
Please let me know if there is any other information that I need to report you,
Best regards,
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi, Guilhermefsmiguel,
I reviewed the results and noted that you have updated all the components of the BIOS-firmware of the Intel® Server Board S2600COE and you are still getting random reboots.
According to the previous troubleshooting, we can narrow down the issue to the operating system or it could be a problem with the power distribution board included with the chassis.
If you have an Intel chassis, please send me the product code (usually outside of the chassis), and PBA number and the serial number (it is necessary to remove it from the chassis) of the power distribution board, I will review the warranty status and proceed with the replacement of it.
Regards,
Mike C
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Mike, I need to thank you for your time and support on this issue.
Attached are all of the info that is attached to the board, cabinet and power source. One of the students took the pictures for me, to speed-up the process.
I need to ask you if this problem could not be caused by the boar didn't detecting the Linux. It is strange, since we are using the same Linux version and installation method on all of the machines. Do you think, for example that it could be related to UEFI boot? Because, when I leave the machine on the grub2 screen ( not booting the OS ), it did not restart ( I made a test a few days ago ).
It can be a misunderstanding from me, but I believe that upgrading the BIOS etc, the time between the boot into Linux and reboot got smaller.
Once again, thank you for your time and support!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi, Guilhermefsmiguel,
Following your case with random reboots on the Intel® Server Board S2600COE.
I have reviewed the pictures of your system, everything looks well. The wattage of the power supply is good enough. The server chassis was not designed for this board but it is possible to connect the power supply to the motherboard without a power bridge.
https://www.intel.com/content/dam/support/us/en/documents/motherboards/server/s2600co/sb/s2600co_config_guide_19.pdf s2600CO configuration guide
Now, OpenSuSE LEAP 42.1 has not been validated by our engineer department, so you can see random issues.
https://www.intel.com/content/www/us/en/support/articles/000007556/server-products/server-boards.html Operating System Compatibility for Intel® Server Board S2600CO Family
I suggest you to double check OpenSuse Leap support website and review UEFI compatibility documents. Double check if your system is already using grub.efi and shim.efi
https://forums.opensuse.org/showthread.php/511105-UEFI-install-issues-with-LEAP-42-1 UEFI install issues with LEAP 42.1
https://en.opensuse.org/openSUSE:UEFI openSUSE:UEFI
Regards,
Mike C
 
					
				
				
			
		
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page