I have a four disks raid 10, for a couple of months now, working rather badly. At least once a week a rebuild starts, with no indication of the reason. Of course this takes several hours, it gives no errors, but then later the problem reoccurs. The disks are four identical Seagate ST31500341AS, with no problems when tested separately. There is no spin down configured, the problem usually occurs when working normally. What seems really bad is that there is no indication whatsoever of why the rebuilds are triggered. Only once, in several times that the problem occurred, I found the following in the windows event log:
Date: 2011-02-08 15:21:18
Disk on port 0: Removed.
And a second later:
Date: 2011-02-08 15:21:19
Disk on port 0: Detected.
Other configuration details: Motherboard Asus P5Q with ICH10R, RAID option ROM version: 18.104.22.1688.
I've seen mentioned here a lot of similar problems which make me think the problem is not really in the disks.
Thanks for any suggestion,
Does it happen to the same disk?
Try a new SATA cable on the failed disk.
Do not go by port number go by serial number and check if the failed drive is on the same power cable that links your other disks then swap it with a working SATA power connector and see if it fails the same disk again.
Next time it fails wipe the drive before it rebuilds and run SeaTools with Long Drive Self Test & Long Generic.
Unfortunately I don't know if the problem happens always on the same disk. Except for that single message in the Windows event log mentioning a port number, usually there is no clue why a rebuild is started. I just notice the rebuild going on, and when I see it, there no disk marked as failed, and I don't know if there is a way to know which disk is being rebuilt. It all happens automatically.
If I look in the Windows event log usually I just see a message that a volume is "degraded", immediately followed by a "rebuilding in progress".
And I can't find anything configurable in the RST software allowing to get a better log of events, or to stop "automatic" rebuilds (I also expected to have the possibility to see, during a four disks raid 10 rebuild, which disk is being rebuilt, but I don't see anything like that).
Anyway, I will check again the cables, and try swapping the power ones.
Unfortunately most of the time there is no mention of a specific disk in the logs. I just have a generic:
The device, \Device\Ide\iaStor0, did not respond within the timeout period.
in the System log (here \Device\Ide\iaStor0 is the whole controller).
And then I have a "volume degraded" immediately followed by "Rebuilding in progress" in the Application log.
My volums are listed as "Initialized".
I'm not doing any overcloking.
You might have incompatble disk drives.
I had problems like this before. So your drive goes into some "internal check". This internal check does not finish within a certain time. So your RAID controller sees the drive as being offline and triggers the error.
So these "RAID" drives that you have to pay more for (which is bullsh*t and I'll explain why shortly) support TLER. Time Limited Error Recovery. Which basically means that it limits the time for these internal checks so that the drive does not time out.
The BS is that in some earlier version of certain drives by Segate you could change the TLER on non-"RAID" disk drives. But they got rid of that feature in some firmware upgrade or got rid of the disk drives, or they got rid of the utility TLER.EXE which allowed you to o that. But you will still find it on the web.
As I said this might be your problem.
Hope this helps.
I've seen other mentions of this disk "internal check", I will try to understand more...
Assuming this is the problem, my first reaction is to put the whole blame on the raid controller/software, not on the disks. The raid controller should simply allow this (with reduced performance, if necessary), after all raid means "redundant array of INEXPENSIVE disks" :-)