We have an MFSYS25 chassis (details below) which has a strange but rather critical problem.
Whenever a drive fails on the RAID setup we have the entire system locks up. The fans speed up; all of the compute modules restart but do not go anywhere, and we can barely manage the system as the management console hangs up. What we find that we have to do is completely unplug and plug back in the power to the system and then start the rebuild on the failed drive. Then, we can start the compute modules and get the various servers going again, etc.
Has anyone else experienced this? What can we do to add some resiliency to the system?
Thank you for your time.
Details of our MFSYS25 System are as follows:
- Chassis Management Module: Part Number: D70735-403
- Server Storage Module: Part Number: D70737-404
- Gigabit Ethernet Switch: Part Number: D70739-404
- Six (6) Server Compute Modules: Part Number: D70726-404
- Firmware Versions:
- Server 1BMC Firmwareok1.36.6 BMC Bootok0.10 BIOSokSB5000.86B.10.00.0050.083120090939Server 2BMC Firmwareok1.36.6 BMC Bootok0.10 BIOSok<td class...
What firmware version is your chassis on? Might be worth loading the latest version if you haven't already.
Which drives are you using? Are they on the Intel compatibility list? Are they all the same firmware version?
With RAID in general I have seen some odd things where a single faulty disk can cause entire volumes to go offline or make other disks "disappear". Even faulty backplanes cause similar weirdness. If the same drive is repeatedly going offline, maybe start by replacing that drive or moving it to a different slot to see if the problem follows the drive or slot.
I had the same thing happening 2 weeks ago after one drive broke down.
I was still able to connect to the web page for configuration and rebuild of the drive was on 0% for 5 hours. All VMs on internal storage where shut down, however the VMs on the vtrak were still running. I was unable to bring storage up again and had to pull all 4 power cables, like you wrote as well.
I had drives fail in the past that rebuilt without problems.
To reduce the risk of a similar failure I split my internal storage into 2 storage pools, hoping that only the broken one will be affected by the crash.