Server Products
Data Center Products including boards, integrated systems, Intel® Xeon® Processors, RAID Storage, and Intel® Xeon® Processors
4931 Discussions

MFSYS25 - One Drive in Storage Array Failing Causes Entire Chassis to Fail?

idata
Employee
1,734 Views

Hello,

We have an MFSYS25 chassis (details below) which has a strange but rather critical problem.

Whenever a drive fails on the RAID setup we have the entire system locks up. The fans speed up; all of the compute modules restart but do not go anywhere, and we can barely manage the system as the management console hangs up. What we find that we have to do is completely unplug and plug back in the power to the system and then start the rebuild on the failed drive. Then, we can start the compute modules and get the various servers going again, etc.

Has anyone else experienced this? What can we do to add some resiliency to the system?

Thank you for your time.

Details of our MFSYS25 System are as follows:

  1. Chassis Management Module: Part Number: D70735-403
  2. Server Storage Module: Part Number: D70737-404
  3. Gigabit Ethernet Switch: Part Number: D70739-404
  4. Six (6) Server Compute Modules: Part Number: D70726-404
  5. Firmware Versions:
    1. Server 1BMC Firmwareok1.36.6 BMC Bootok0.10 BIOSokSB5000.86B.10.00.0050.083120090939Server 2BMC Firmwareok1.36.6 BMC Bootok0.10 BIOSok<td class...
0 Kudos
2 Replies
idata
Employee
577 Views

What firmware version is your chassis on? Might be worth loading the latest version if you haven't already.

Which drives are you using? Are they on the Intel compatibility list? Are they all the same firmware version?

With RAID in general I have seen some odd things where a single faulty disk can cause entire volumes to go offline or make other disks "disappear". Even faulty backplanes cause similar weirdness. If the same drive is repeatedly going offline, maybe start by replacing that drive or moving it to a different slot to see if the problem follows the drive or slot.

0 Kudos
OBlau
Beginner
577 Views

I had the same thing happening 2 weeks ago after one drive broke down.

I was still able to connect to the web page for configuration and rebuild of the drive was on 0% for 5 hours. All VMs on internal storage where shut down, however the VMs on the vtrak were still running. I was unable to bring storage up again and had to pull all 4 power cables, like you wrote as well.

I had drives fail in the past that rebuilt without problems.

To reduce the risk of a similar failure I split my internal storage into 2 storage pools, hoping that only the broken one will be affected by the crash.

0 Kudos
Reply