A customer of ours has an Intel R2208GZ4GC server which is running Windows Server 2012 R2 as the host OS. It has 2 x SSDs (configured as RAID 1 for the OS) and 5 x 600GB SATA 6G drive. There are two VMS also running 2012 R2.
The BIOS is at 2.06.0005 which is the latest for the hardware and the RAID firmware is also up to date. The latest updates have been applied to both the host and guest operating systems.
The server has hung every 24-48 hours for the last 5 days. There were no previous problems and nothing changed preceding the hangs ( that I know of at least!). The symptoms are:
- Server appears to stop normal operation.
- If you hit CTRL+ALT+DEL on the console you get the login screen but can get no further.
- All connections to the guest VMs die.
- There are no entries in the Windows event log during the hung period.
- There are no preceding entries in the Windows event log which suggest there is a problem.
- The system status LED (triangle) is flashing green, but an examination of the SEL suggests this is to do with a detected issue with power supply redundancy. Both PSUs are working fine so this is probably a red herring.
- There are no other entries in the SEL which would suggest any problems.
- The only cure is a power-cycle.
Grateful for any suggestions on what we could try next to isolate the problem.
This issue you describe sound like software / Windows related. But if you want to make sure about possible hardware related issues you might try to retrieve the board system logs using the following tool https://downloadcenter.intel.com/download/25440/System-Event-Log-SEL-Viewer-Utility?product=56262 https://downloadcenter.intel.com/download/25440/System-Event-Log-SEL-Viewer-Utility?product=56262 . It will get the logs stored on the BMC showing all the info related to sensors, voltages and such.
Please save the log into a file and let us know.
Yes, we initially thought it was Windows but since there were no obvious errors or problems there we started to look at the hardware. The only changes we have applied hardware-wise so far are upgrading the BIOS to the latest version, re-seating RAM and blowing out a lot of dust that had accumulated within the server.
I've attached the SEL log from the past couple of days ( I zeroed it at the weekend). It doesn't seem to indicate anything major is wrong but you might see something we are missing. Note that you'll see a flurry of activity yesterday afternoon as we updated BIOS and tested the PSUs.
Thank for attaching the logs. I looked at them and the only thing that attracted my attention were repetitive warnings about PS redundant power lost and a couple of critical records about PS with insufficient resources to maintain normal operation. I assume these happened during your PS testing. This could possibly generate a reboot or a total shutdown due to insufficient power but I don't think it could create a system hang.
Lets allow the system to run and lets gather the logs right after any of the hangs reoccur so we can see if hardware related errors are shown.