"Reset to device" Megasr1

SDoug3 · ‎08-07-2012

Hello,

My server (s5520hc) is reporting, in the system event log, "Reset to device \device\raidport0 was issued megasr1." There is also a report of a bad block.

I have updated all the system's firmware and started a chkdsk. There are no updated drivers since the server was initially installed. Windows security is updated completely. The system is installed with Windows Server 2008 R2 Core and the RAID reports fine, within the RAID Bios, and is online. The system is configured with three 1TB drives RAID 5.

Are there any suggestions someone might give as addition things to perform and/or checkout? I don't like using whiz-skbang third-party utilities if at all possible. The resetting just started recently as far as I can tell.

Thank you,

Steve

AP16 · ‎08-07-2012

LSI hardware&soft still do not allow to get full SMART stats from individual HDDs in RAID array. The only way to learn about current HDD health stats is to dismount an array, connect disks to MB individually and look an their SMART reports via advanced SMART-capable tool, like Sandra or Everest.

Resets usually happened when a HDD is take for operation completion too much time (for example, if real bad block occurs on disk plate and remapping process from spare area was initiated in HDD). RAID controllers have limitations for HDD answer time, so if HDD is waiting for error resilience too long RAID controller catch this as a physical sense lost and resets the channel.

There are a special feature of enterprise HDDs to deal with this trouble called SCT Error Recovery Control. This feature define a maximum time for HDD to wait for read/write error recovery. Desktop HDDs have this time as 0 - i.e. wait infinitely, enterprise HDDs (RAID-edition) usually limit it for like 7 s, HDD will report for RAID controller about operation incompletion in the right time, RAID controller will reissue the operation to HDD without link reset and HDD will complete error recovery actions in background.

Unusually long response time for HDD generally said that disk is near dead, but sometimes disk is still fine, only need the SCT limitation. Because RAID-edition and desktop disks are usually based on same hardware, only firmware settings and tests duration differ, SCT Error Recovery Control time can be changed from a default value of 0 to something affordable for RAID setup. In Linux SCT time for HDD can be set via smartctl tool (smartctl -l scterc /dev/sda will show current value in 0,1 s units), in other OSes you need an utility from HDD manufacturer or use of a Linux Live CD.

Another potential source of a disk timeouts is a write cache of a HDDs. You must check that HDDs write cache is disabled in RAID BIOS/console settings for a array. In rare times RAID firmware unable to switch off write cache of a desktop HDDs connected, so again use of a manufacturer supplied tools is required for this task completion on each individual HDD.

SDoug3 · ‎08-07-2012

Mr. JFFulcrum, thank you for the clarification. My drives are normal desktop grade SATAs. I figured going with the internal RAID controller (activated by a chip of sort on the motherboard) was sufficient.

Thanks for your insight. I am going to look into the LIVE CD and HDD mfg information. I appreciate it.