Rapid Storage Technology
Intel® RST, RAID
2045 Discussions

Does RST provide a means to "make volume happy"?

LBeaz
Beginner
1,270 Views

Powering up this morning, RST told me I had a problem with a RAID volume. The RST Console indicated a RAID 5 Volume had failed due to, two RAID Member drives gone "missing". It also showed two "new" non-RAID "unassigned" drives on the "missing" drives' ports.

I rebooted the box and entered the RST Option ROM and found the same status.

The system shut down cleanly last night and I do not have Write Caching enabled so I know the RAID 5 Volume was good and solid on the shut down.

Booting to a USB Drive with troubleshooting tools, I have checked out the SMART data on both "missing" drives and there are no errors. I have done several complete read passes of both drives (one as a back up copy to two spare drives) and there were no errors during any of the read passes. I also read the partition structures on both drives with a partition reader and it successfully identified partitions on both drives with no errors.

Given the drives checked out, the only thing I can think of that might have happened is that the two drives did not come ready in time for the RST driver to see them so it marked them as missing in the status it keeps on the RAID member drives, that were found on that boot sequence, and of course with two drives "missing" in a RAID 5 volume, it also marked the RAID 5 Volume as failed.

Since I know there was no data lost/deleted from the two "missing" drives, I believe this is just a simple matter of overriding the status of the two missing drives so I can put them back into the RAID volume essentially "making the volume happy" again.

If I could do this kind of override, then I would be in a position to verify whether the data was any good or not as opposed to just assuming the data is lost and starting over by deleting what is left of the current RAID 5 volume and creating a new RAID 5 volume.

Is this kind of override manipulation of Volume and Drive Status available in RST and if not why not?

0 Kudos
9 Replies
idata
Employee
375 Views

Hello Lance_Beazley,

 

 

Thank you for contacting the Intel community.

 

 

RAID 5 is a redundant array which means it can survive failure of one of the member disk. If failure of your RAID 5 affects only one member disk then you are lucky and can easily get your data back. In case of multiple disk failures there is zero chance of recovering the array data.

 

 

 

Regards,

 

 

 

Ivan U.

 

0 Kudos
LBeaz
Beginner
375 Views

Ivan U.

Thank you for the reply. However it does not answer my question and I did not indicate the drives had failed.

What I asked was,

Is the ability to override Volume and Drive Status available in RST, and if not, I would like to know why the decision was made not to include it?

Lance

0 Kudos
idata
Employee
375 Views

This option is only available if you can recover the RAID from IRTS, you will need to click on the HDD and see if the option "reset to normal or rebuild to this disk" is available, but if a RAID 5 shows as failed, you will need to create the whole RAID volume again.

 

 

When a RAID shows as failed none of this options are available.

 

 

 

Ivan U.

 

0 Kudos
LBeaz
Beginner
375 Views

And I would argue the higher level point that the RAID volume should not be thought of as FAILED in a situation where drives have gone missing.

First, know that I do understand that when more drives actually fail than a RAID type can recover from, there is no possible recovery of the data.

I asked the question the way I did because (having some back ground in RAID and disk drives) I was quite certain that in my specific situation, my drives had not failed and no writes had been done to the volume set other than the "FAILED" status update RST made to the "RAID meta data", on the remaining drives in the volume set, when it determined the two drives had gone missing.

The next few days I went surfing on the web, and uncovered dozens of other posts, dating as far back as 2006, in which other people have claimed RST (or its prior forms) had dropped more than one drive member out of a RAID volume set. They too, were pretty certain their drives had not actually failed and all of them were looking for a way to put the drives back into the volume set without going through an entire delete, create, initialize cycle.

I even stumbled across a few posts from Intel support people in which the end user's claims were acknowledged but the support people indicated that Intel's RST development team had not been able to reproduce the situation so no fix was in the works because a problem had not been identified.

So, don't think of this situation as a problem. It's not a problem. If members of a RAID set go missing, they are indeed missing. However the drives went missing; procedural problem, mistake pulling drives, bug in software, whatever, they went missing. And if the number that went missing is beyond the ability of the RAID set to continue serving data then the RAID Set can't be accessed.

This should not mean the RAID set is FAILED though.

A RAID set should only be FAILED once the "missing" drives have indeed been determined to have failed and that isn't a feature that RST was written to provide.

There are many, many possibilities for why drives could go missing from a RAID set and most have nothing to do with actual drive failures.

To let RST make the assumption that missing drives have failed, radically reduces RST's ability to provide its primary reason for being, protection of data from a drive failure.

In all design decisions like this, the designer should be taking the side of data. Data is the valuable thing and in this particular case, its side is not being taken.

BTW, it is really easy to reproduce this situation. Simply create a RAID 5 volume, initialize the volume, put some data on the volume, shut the system down, remove two of the RAID set's drives, power up the system and the RAID 5 Volume is now FAILED because two drives are missing.

I have made use of RST in its various forms over may years. It has been a great product and it still is but with the amount of data we end users are now keeping, it becomes a huge problem to restore a "failed" RAID with backups of the data simply because of the size of the arrays and the amount of time it takes to pull the data from where ever it has been backed up to. Of course, this is what has to happen when a RAID volume actually has multiple failed drive members.

When the drives have not failed it is a huge waste of time and creates an additional possibility for lost data due to handling and other mistakes when restoring backups.

And just to emphasize how un-failed my RAID 5 volume was / is, I was easily able to "recover" the original volume via a partition tool and a set of steps that several other end users developed through trial and error over the years/posts. The data's verify via binary compares against backup data finally finished last night with no errors found.

0 Kudos
idata
Employee
375 Views

Thank you for sharing this information with us.

 

 

That is right you can easily reproduce the issue with a RAID 5 if you remove two HDDs of the three installed and that is normal behavior because you will be breaking the volume RAID and you are forcing it to fail, the HDDs are ok but the RAID will be break.

 

 

Why the HDDs disconnect from the RAID is hard to tell, it could be a BIOS issue, SATA cable issue, power failure etc, it will be very difficult to tell why they get remove from the RAID.

 

 

I'm sorry for this inconvenience and I'm glad that you were able to recover your RAID and thank you for taking your time to let us know.

 

 

 

Best wishes,

 

 

 

Ivan U.

 

0 Kudos
LBeaz
Beginner
375 Views

It is difficult to tell why HDDs get disconnect from a RAID Volume.

The key take away here is that this does happen and most of the time it is due to reasons outside of RST's control.

And when the RAID becomes inaccessible because of the disconnected HDDs, it's status should be thought of as OFFLINE or INACCESSIBLE as opposed to FAILED.

So how would one go about suggesting that an enhancement be made to the RST code in the option ROM so a user could choose whether the upper level RST windows driver was told to initialize a newly made RAID Volume or not?

From RST's perspective that is the only change that needs to be made to give control back to the user in these "HDDs disconnect from the RAID" cases.

In my recovery case, I had to uninstall the RST upper level Windows driver to keep it from initializing the RAID Volume after deleting and recreating it so the RAID Volume meta data included the missing drives.

If I could tell it not to do this as part of the creation process in the Option Rom code then I would not have had to uninstall it and the recovery process would have been completely under my control.

0 Kudos
idata
Employee
375 Views

I understand and thank you for your feedback, please bear in mind that IRST is not fault tolerance and this is software RAID solution if you need something more reliable you will need to get a RAID controller card which is hardware RAID solution.

 

 

 

Regards,

 

 

 

Ivan U.

 

0 Kudos
LBeaz
Beginner
375 Views

I'll give it up here but I have to say that we aren't talking about Fault Tolerant here and I can't actually believe you just took the stand you just did.

I'm just trying to point out that if IRST simply viewed the status of a RAID Volume a bit differently and provided a single knob to control initialization behavior the application would be 100% safer WRT the data it is trying to protect.

0 Kudos
idata
Employee
375 Views

I'm sorry if I misunderstood, I apologize, I understand what you are saying and for now every feedback is very much appreciated.

 

 

Regards,

 

 

 

Ivan U.

 

0 Kudos
Reply