Re: Puzzling disk system problems on S3000PT server boards

idata · ‎04-26-2012

Dear disk gurus…

I have a small research cluster based on S3600PT boards. Recently, I have hit some puzzling problems. A couple of weeks ago, I replaced dead fans in 5 of the 20 machines. About four days later, two of the machines started having very similar disk system problems - puzzling enough that I'm at a loss what to do (two machines down out of 20 seriously dents our research effort; and we don't have any funding to upgrade, so anything we can do to fix these would be really worthwhile).

To summarise: both machines get disk errors when booted from their HDD. In both cases, if I boot a fedora DVD, and run disk utility, their SMART status shows millions of read/write errors; and if I run a read-write test, it aborts with read-write errors. I have performed the following additional tests:

. tried both SATA connectors - no change

. replace SATA cables - no change

. replace disks - no change

. power disk from an external power supply - no change

All of this seems to suggest that the problem is the SATA controller.

The puzzling aspects are:

. if I read the SMART status of any of the four disks I have tried on another machine, they show no history of read/write errors; my understanding was that the SMART status was stored on the on-disk controller, so that the same status should show on any machine it was connected to. Is this a misunderstanding?

. I've run the full test suite from ?? on one of the failing machines, with one of the failing disks connected. No errors at all were detected.

Other things I've tried:

. Running the machines with the replaced fans disconnected (in an environment where I could guarantee plenty of external airflow) - originally, I suspected that the replacement fans might be overloading the power supply (though in theory their power demand is slightly lower than the originals; I also wondered whether the new fans might be injecting noise into the power supply, but again, running with them completely disconnected should have fixed this).

. Reflashing the BIOS

. Fully powering down (i.e. removing the onboard battery for half an hour)

I've had no joy with any of these.

If you have any suggestions on further tests I could run, that would be great! If not, please can I get your thoughts on another alternative: if the SATA controller really is smoked, could we resuscitate the machines by installing consumer grade PCIE SATA controllers? There's a PCIEx8 slot spare on the board, though it will take a bit of jiggery-pokery and a flexible PCIE cable to get it to fit in the blade chassis.

Thanks in advance for any suggestions

Best Wishes

Bob

idata · ‎04-26-2012

Sorry, I hit the return key before I remembered to go back and fill in the hardware test; I used the 'inquisitor' test suite for this.

Edward_Z_Intel · ‎04-26-2012

The first thing came to my mind was vibration. SATA drives, especially desktop drives, are very sensitive to vibration. Enterprise level drives have better tolerance against vibration. You may want to check the http://www.intel.com/support/motherboards/server/s3000pt/sb/CS-023447.htm Tested hardware and operating system list.

You can test the drives outside of the chassis. You can also remove the fans temporarily and try to use a external fan to cool down the system. This may help you to identify the issue.

idata · ‎04-26-2012

Thanks Edward, I think you are right about vibration - the new fans are certainly acoustically noisier than the originals (which unfortunately are no longer obtainable), and could be mechanically noisier as well. It would also explain why the two blades which failed are physically right next to each other. Since we probably can't avoid the noise, I think the only solution is to mechanically isolate the disks and fans as far as possible (and when we need to replace the disks, to get server quality disks). I have done some testing with the disks physically isolated from the system. Three of the four are perfectly OK now, except for reporting many bad sectors (I suspect these are not actually bad, but have been marked bad because of previous errors). On these, I can just ignore the bad sectors and use the disks until they fail (it's easy to reinstall the cluster S/W, so the cost of a dying disk is small). The fourth is still reporting read/write errors at a particular location. I guess I have to assume that this disk really is bad (perhaps a head crash because of the vibration), although I'm still puzzled why it didn't show any errors on a surface scan from the hardware test package, and doesn't show errors when it is mounted on a different system.

For the physical isolation, I think the best I can do is install felt washers where the disks and fans are mounted, as there's not much space for any complex mechanical isolation. If you have any other suggestions, they would be really appreciated.

Thanks again Edward. We typically publish about ten papers a year based on this system, which I hope will last for the next 2-3 years. You've saved us at least 10% of the system (more if we were going to see further such failures down the track). So you can consider yourself as making at least a one paper a year contribution to the field of evolutionary and complex systems .

Best Wishes

Bob

idata · ‎04-26-2012

PS I originally marked this answer as "helpful" because I thought it wasn't completely consistent with what I was seeing. Further testing convinced me that it's the whole answer. But I can't find any way to change the marking to "complete answer". Edward, if it matters to you, and if you know any way to change the marking, please let me know and I'm happy to change it.

Best Wishes

Bob

Edward_Z_Intel · ‎04-27-2012

Glad to be able to help. No worry about the marking.

In term of physical isolation, maybe the ultimate solution is to use a better system fan. If that's not possible at the moment, adding robber pad on the fan and disk carrier would be helpful. Again, enterprise level HDDs have better tolerance against vibration.

Regards,

Edward