- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Today night I had a followed trouble with S2600GZ server Serial Number QSGR21400619
Server uptime was about a half year. Heavy load about a 3 month.
What should I change tha first of all
Proccessor if yes wich one?
Memory ?
or motherboard ?
Also I can send DebugLog file on demand.
31505.12.2013 8:49BIOS Evt SensorSystem Eventreports OEM System Boot Event - Asserted31405.12.2013 8:48BIOS Evt SensorSystem Eventreports Timestamp Clock Sync. Event is one of two expected events from BIOS on every power on. - Asserted31305.12.2013 8:48BIOS Evt SensorSystem Eventreports Timestamp Clock Sync. Event is one of two expected events from BIOS on every power on. - Asserted31205.12.2013 8:47Pwr Unit StatusPower Unitreports the power unit is powered off or being powered down - Deasserted31105.12.2013 8:47Pwr Unit StatusPower Unitreports the power unit is powered off or being powered down - Asserted31005.12.2013 5:43BMC FW HealthManagement Subsystem Health'P2 Therm Ctrl %' sensor has failed and may not be providing a valid reading - Asserted30905.12.2013 5:43BMC FW HealthManagement Subsystem Health'P1 Therm Ctrl %' sensor has failed and may not be providing a valid reading - Asserted30805.12.2013 2:48BIOS Evt SensorSystem Eventreports OEM System Boot Event - Asserted30705.12.2013 2:47Mmry ECC SensorMemoryUncorrectable ECC. CPU: 1, DIMM: B1. - Asserted30605.12.2013 2:47BIOS Evt SensorSystem Eventreports Timestamp Clock Sync. Event is one of two expected events from BIOS on every power on. - Asserted30505.12.2013 2:43BIOS Evt SensorSystem Eventreports Timestamp Clock Sync. Event is one of two expected events from BIOS on every power on. - Asserted30405.12.2013 2:43CATERRProcessorreports it has been asserted - Deasserted30305.12.2013 2:43CATERRProcessorreports it has been asserted - Asserted30204/29/2013 08:26:20PS2 StatusPower Supplyreports a predictive failure has been detected for the power supply - Deasserted30104/29/2013 08:26:19PS2 StatusPower Supplyreports a predictive failure has been detected for the power supply - Asserted30003/19/2013 15:15:50PS2 StatusPower Supplyreports a predictive failure has been detected for the power supply - Deasserted29903/19/2013 15:15:49PS2 StatusPower Supplyreports a predictive failure has been detected for the power supply - Asserted29803/18/2013 22:55:12Pwr Unit RedundPower Unitreports redundancy has been lost, but the unit is still functioning with the minimum amount of resources needed for normal operation - Deasserted29703/18/2013 22:55:12Pwr Unit RedundPower Unitreports redundancy has been lost - Deasserted29603/18/2013 22:55:11PS2 StatusPower Supplyreports a predictive failure has been detected for the power supply - Deasserted29503/18/2013 22:55:11Pwr Unit RedundPower Unitreports redundancy has been lost, but the unit is still functioning with the minimum amount of resources needed for normal operation - Asserted29403/18/2013 22:55:11Pwr Unit RedundPower Unitreports redundancy has been lost - Asserted- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
307
05.12.2013 2:47Mmry ECC SensorMemoryUncorrectable ECC. CPU: 1, DIMM: B1. - AssertedThis is a hard failure on the DIMM.
Which most likley resulted in this error as a secondary message
30305.12.2013 2:43CATERRProcessorreports it has been assertedThis two are very strange. I have seen simular on early Engennering Sample Processors , but not on Production processors. Indicates the BMC can't read the CPU tempeature so fans will all go to 100%
'P2 Therm Ctrl %' sensor has failed and may not be providing a valid reading - Asserted
Might be related to the Dimm is the dimm is hanging the i2C bus but very strange.
I would recommend:
replacing DIMM B1as this is the error that tool the system down.
Update to the newest code stack release for BIOS, BMC, ME and FRUSDR (may fix the PSU messages)
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
307
05.12.2013 2:47Mmry ECC SensorMemoryUncorrectable ECC. CPU: 1, DIMM: B1. - AssertedThis is a hard failure on the DIMM.
Which most likley resulted in this error as a secondary message
30305.12.2013 2:43CATERRProcessorreports it has been assertedThis two are very strange. I have seen simular on early Engennering Sample Processors , but not on Production processors. Indicates the BMC can't read the CPU tempeature so fans will all go to 100%
'P2 Therm Ctrl %' sensor has failed and may not be providing a valid reading - Asserted
Might be related to the Dimm is the dimm is hanging the i2C bus but very strange.
I would recommend:
replacing DIMM B1as this is the error that tool the system down.
Update to the newest code stack release for BIOS, BMC, ME and FRUSDR (may fix the PSU messages)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks.
Anyway memory is chipest.
Empirischen question. Computer in inexpensive car can exactly tell what is going on with the car. Why computer inside computer can't?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You must have better luck with Auto OBD codes than i have. I usually get 3 or 4 codes in my car and then have to figure out which of 3 or 4 component is bad.
Hmmm, you had 3 or 4 codes on your computer...... Wonder if the same guy wrote the code?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
So, really I have little bit more codes in my automatic transmission diagnostic.
Conclusion. I've got the best answer from Intel support. They just suggest to swap the B1 and another DIMM.
I do at Friday. Today server down twice with the same error for other memory slot. We plug there the same memory, but with other party number. Hope it will be done.
Thanks anyone for your time.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello everybody.
I have a same problem with my server. I bought it one year ago.
Please, see configuration:
So, today I have two unexpected restarts, at 05:45 and 07:00. From that moment it was six hours, server is running fine.
I installed SEL Viewer and I see two errors like Topic starter: one with DIMM and one with CATERR.
Please, see SEL file: https://www.dropbox.com/s/xk3mdm8zdmqlr29/Sel12122013.sel Dropbox - Sel12122013.sel
Unfortunately, I can be in server room only after 8 hours from now (its closed at night).
Tell me please, what I need to do at the morning?
I have no DIMM modules like this, but I can buy it (but it will different party number)? If I can't buy it tomorrow, is it possible to remove (not replace) the first module? Does server will work fine?
Thank you.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'd suggest you replace DIMM D1 first.
I think it should be OK to temporarily remove the DIMM from D1 slot. Just remember that for each CPU, all blue DIMM slots need to be populated before the black slots.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page