Why is Matrix Manager marking a drive as bad that isn't bad?

idata · ‎06-05-2009

Problem Description

I just built a new system. I have a pair of 640GB drives mirrored as the system drive(s). I have 4x1TB drives in RAID10 for storage. Initially I had the storage volume set up as RAID5. The processor is an i7 920 running on an Asus P6T motherboard, running Windows 7 RC1 and has 6 GB of DDR3. There are six useful SATA ports on the board (0 to 5). Ports 0 and 1 are the used for the mirrored system volume. Porst 2 through 5 are used for the storage volume.

Just after I finished building the machine and installing the OS (with the storage volume being RAID5) the Matrix Storage Manger reported that SATA Port # 4 had failed. I removed the drive and replaced it (I bought one drive extra as a spare). The machine rebuilt the array. In the meantime, I installed the 'bad' drive in an external hard drive enclosure and formatted it to NTFS and ran disk check on it, including looking for bad sectors. The disk check came back clean. No problems with the supposedly 'bad' hard drive.

A day later I had an issue with the drive at port # 4. I took it out of the raid array via the config utility, and added it back to the RAID array without physically changing the drive. It rebuilt and ran fine for a while. I doubt very much if two new drives are bad, and especially so when the storage manager reported that it added the 'new' drive into the array and that it functioned normally... even when I didn't really put a new drive back in.

The next day I rebuilt the storage volume as RAID10 for performance reasons. Last night, after about a week of no issues, I removed a drive from a set of two other SATA ports that Asus adds to the board that serve no useful purpose from what I can see. They support some 'special' features that I can't make heads or tails of (and I wished they were just two more 'normal' sata ports). They call them JMstore or something like that. Anyway, I attached my SATA DVD R/W Drive to one of those and attached the front panel esata port to the second. The dvd works but the system (windows 7) didn't recognize that the external drive was connected, to I shut down, removed the external drive from the front panel and attaced it to an sSATA port build right into the back panel of the motherboard (on the back of of the PC with the sound output plugs, usb ports, etc).

After that happened I restarted the machine and low and behold the Matrix Drive Manager told me that the drive on SATA Port # 4 had failed again. That and the drive on SATA Port # 1 of the mirrored system volume was showing failed. The machine was not shut down abnormally when working with the eSATA drive issue. So I shut down the system and restarted it, and when the RAID config screen came I pressed and marked those drives as not being part of the RAID sets (e.g. the drives on ports 1 and 4). I then added them back to their respective RAID arrays and continued with the reboot. The system came back up and rebuilt the drives. That is, the system did not say, "those drives are failed and I can't add them to your RAID volumes." I belive like the first time, the drives were not really bad, but that there is either something wrong with the Matrix Storage Manager or with the Mobo.

Questions

As I sit here I just checked and the Storage Manager is telling me the drive at SATA Port 4 has failed again. Why would the storage manager report a drive as being bad when it is not? Would anyone believe that it is the Mobo SATA port # 4 that is bad and not the Matrix Storage Manager? And why?

I am definitely going to talk to the store I bought it at and will likely get it exchanged. But I am interested in any feedback from the people here and especially from Intel.

Matrix Storage Manager 'Storage Report' follows:

---------------------------------------------------------------------------------------------

System Information

Kit Installed: 8.9.0.1015

Kit Install History: 8.9.0.1015, Uninstall

Shell Version: 8.9.0.1015

OS Name: Microsoft Windows 7 Ultimate

OS Version: 6.1.7100 Build 7100

System Name: TBONE

System Manufacturer: ASUSTeK Computer INC.

System Model: P6T

Processor: Intel(R) Core(TM) i7 CPU 920 @ 2.67GHz

BIOS Version/Date: American Megatrends Inc. 0603 , 05/19/2009

Language: ENU

Intel(R) Matrix Storage Manager

Intel RAID Controller: Intel(R) ICH8R/ICH9R/ICH10R/DO/PCH SATA RAID Controller

Number of Serial ATA ports: 6

RAID Option ROM Version: 8.0.0.1038

Driver Version: 8.9.0.1015

RAID Plug-In Version: 8.9.0.1015

Language Resource Version of the RAID Plug-In: 8.9.0.1015

Create Volume Wizard Version: 8.9.0.1015

Language Resource Version of the Create Volume Wizard: 8.9.0.1015

Create Volume from Existing Hard Drive Wizard Version: 8.9.0.1015

Language Resource Version of the Create Volume from Existing Hard Drive Wizard: 8.9.0.1015

Modify Volume Wizard Version: 8.9.0.1015

Language Resource Version of the Modify Volume Wizard: 8.9.0.1015

Delete Volume Wizard Version: 8.9.0.1015

Language Resource Version of the Delete Volume Wizard: 8.9.0.1015

ISDI Library Version: 8.9.0.1015

Event Monitor User Notification Tool Version: 8.9.0.1015

Language Resource Version of the Event Monitor User Notification Tool: 8.9.0.1015

Event Monitor Version: 8.9.0.1015

Array_0000

Status: No active migrations

Hard Drive Data Cache Enabled: Yes

Size: 1192.3 GB

Free Space: 0 GB

Number of Hard Drives: 2

Hard Drive Member 1: WDC WD6400AAKS-00A7B0

Hard Drive Member 2: WDC WD6400AAKS-00A7B0

Number of Volumes: 2

Volume Member 1: Win7SystemRAID1

Volume Member 2: temp

Array_0001

Status: No active migrations

Hard Drive Data Cache Enabled: Yes

Size: 3726 GB

Free Space: 0 GB

Number of Hard Drives: 4

Hard Drive Member 1: ST31000528AS

Hard Drive Member 2: ST31000528AS

Hard Drive Member 3: ST31000528AS

Hard Drive Member 4: ST31000528AS

Number of Volumes: 1

Volume Member 1: Storage_RAID10

Win7SystemRAID1

Status: Normal

System Volume: No

Volume Write-Back Cache Enabled: No

RAID Level: RAID 1 (mirroring)

Size: 300 GB

Physical Sector Size: 512 Bytes

Logical Sector Size: 512 Bytes

Number of Hard Drives: 2

Hard Drive Member 1: WDC WD6400AAKS-00A7B0

Hard Drive Member 2: WDC WD6400AAKS-00A7B0

Parent Array: Array_0000

XX...

idata · ‎06-13-2009

Mine does that too, on the same port.

I narrowed it down to 2 possibilities:

a) Port 4 is bananas (b-a-n-a-n-a-s!)

b) It fails because of a SMART event. In my setup, I have 2 sections of 3 drives each via RIAD hotplug interfaces. The first disk in slot 3 (last in group 1), and the other 4 are in group 2 (FFU-UUU with F free and U used, order is 1-2-3-4). It's like that because of a recent migration. Anyway, as you can see, the 4th drive is at the bottom. After getting it out and testing it it was fine EXCEPT the SMART log listed a SMART EXCEEDED event with temperature being 69 C (max is 60 or so).

The controller might have spit it out because of SMART fail, even if the drive is good. Mind you, the first drive it spit out was fine too, rebuild, spit, rebuild. After 2 months or so, it actually started to give out bad sectors, cooked most likely. So even though it's nice and all it might still be on its way out. If it's an enterprise drive it must have log. If not, it might. Do a SMART test.

Also, you might want to touch the drive if it fails again. If you can _just_barely_ keep your finger on the hottest spot, it's probably near maxtemp.

Also, I'd pay Intel good money to allow me to see SMART events. Not necessarily expose the drive, just pass them on in a log or so. When the drive gives a clue, write in a text file: Drive 4 maxtemp exceeded. It would be invaluable in the decision to rebuild or replace.

idata · ‎06-13-2009

Unfortunately temperature doesn't seem to be an issue here. The SMART reports for all the drives report max temperatures of only around 40 or 41 degrees. At least if they looked way high, it would tell me something.

idata · ‎06-13-2009

I'd still try a hands-on feel of the drives at high activity or on failure.

Also, they are thrown out when a counter starts increasing, but before failing, such as relocation counter and somesuch. My drive actually failed a while after so I don't think it's simply bananas.

You can try swapping drives. Since it's stripe over mirror, it should (and will, because the order doesn't matter) be safe to swap drives 3 and 4. if it spits drive 3 on port 4, it might be the port after all. Mine never complained of any other port, but those that it did complain about failed soon after. So it might not be bogus. I have 3 failed drives here.

idata · ‎06-17-2009

I had four Seagate 1.5 TB 7200.11 series drives in a RAID-10 for months under Vista 64-bit with no issues under IMSM 8.8. I upgrade to Windows 7 RC1 and IMSM 8.9.0.1015, and every day a drive is marked as "failed" -- sometimes the same one, sometimes a different one. Sometimes the same port, sometimes a different one. Each time, I mark the drive as "normal", and rebuild the array. And then another one is marked as "failed" again in a day or so. I downgrade back to Vista and IMSM 8.8, and no more problems -- it's been a few weeks now.

The inescapable conclusion: IMSM 8.9.0.1015 does not work properly under Windows 7 RC1, at least for some users.

idata · ‎06-18-2009

I didn't know IMSM could mark drives as failed. I believe that's the controller's job.

idata · ‎06-18-2009

Not sure what you mean by "it's the controller's job". You mean the actual ICH chip? Yes, it does the actual "failed" marking, being a piece of hardware, but it is itself controlled by software, the drivers in the IMSM package.

If you mean that the controller *exclusively* decides whether to mark a drive as failed, I don't believe that's correct. IMSM decides to do the actual marking based somehow on perceived controller status, and I believe IMSM 8.9.0.1015 is making bad decisions under Windows 7 RC1.

idata · ‎06-18-2009

Well I don't know how it works, if I did I wouldn't be here with an open question.

However, I'm pretty sure that the IMSC is event-driven and it recieves events from the controller. If it didn't, it would have to keep polling for errors every now and then, it would be bad for hot-plug. As a result, IMSC needs an event, and that event is hardware-based. While it is possible that some version ignores -say- temperature SMART alert for < 50 degrees and other version doesn't, I coulsn't say, but it's possible.

IMSC just kicking drives out for no actual reason and no provocation, I find that hard to believe. I point out that, as in my previous post, the key work is "I believe" it does or doesn't do that. With no documentation or specifications it's speculation and I could be dead wrong. I have a few years as a coder behind and while it gives me insight into implementation techniques it does nothing in the sense of sniffing out real life.

As the darned thing doesn't even have an option to kick out a working drive on user request (while it's OK and rebuilt-to), though I wish I could so I could diagnose the drive while sill inside, I doubt it simply kick stuff out. Maybe some ignoring went berserk in W7, this is possible, even likely. Unprovoked, however, I doubt it.

Oh, and, I've had bad experiences with non-enterprise drives. Internal recovery algos delay drive response, something as little as thermal recalibration could get it spit out of the RAID. There's a reason RE(WD) and NS(Seagate) are twice the price. Well, were, they are getting cheaper now.

The only thing I'm sure about is this (I'll say it again). All drives my controller spit out either were bad or went bad soon after. By soon I mean months. They usually fail faster and faster until I pull them out.

idata · ‎06-18-2009

I don't know anything for sure either, but for the record both IMSM 8.8 and 8.9.0.1015 had the hourly/daily failure marking issue under Windows 7 RC1. I bet there's a reason 8.9.0 has not been officially released yet, and that this is part of it. No official IMSM release has yet been declared as being Windows 7 compatible, and I'm sure it isn't just laziness on Intel's part.

idata · ‎06-19-2009

I think I might have a solution, but if so, it is kind of crappy one.

On the two Arrays, I set "Hard Drive Data Cache Enabled" to 'No'

I already had "Enable Volume Write-Back Cache" set to 'No' for the volumes. Every since then, I've not had a problem. I don't like it though as it removes some of the performance perks of RAID. Since this doesn't come with battery backup etc. then it is rather sensible then. It has only bee about 5 days without issue, so I won't say it is 'fixed', but considering this happened almost daily, it is looking pretty good.

I have a 3ware card coming that has its own risc processor on board as well as 128 MB (and is expandable) or DDR2 RAM AND battery backup. 😉 The only problem is that now I have to buy one for my other tower PC. I can see it is feeling jealous.

idata · ‎08-26-2009

Your issue sounds the same as what's described in this thread: http://communities.intel.com/thread/5036

Did 8.9 remail stable with hd-cache turned off?

gg