Re: Intel HW RAID: consystency check error, puncturing bad block - on new HDDs???

idata · ‎05-05-2013

Hi,

I have an Intel server with an Intel case, board and a SRCSASL4I Intel HW RAID card. I have 2 backpane expander boxes, so I have a HW RAID card with 6 + 4 HDDs inside. I always had problems with the badly choosen WDC 2Tb Green HDDs, but for more than a few months I replaced the disks, so I had 6 x WD RED 2Tb + 4 x Samsung 2Tb. I always had some issues after having these disks also. Now I replaced all remaining non WD REDs, so I have now a WD RED 2Tb x 10 RAID, still with problems!

These errors appeared a few times a week:

Consistency Check detected uncorrectable multiple medium errors:

Puncturing bad block: PD Int.Ports 0-3:2:2

Since I'm having whole new HDDs I cannot understand the error message.

First of all: Int.Ports 0-3:2:2 this error message if for WHAT DISK? In the RAID Web console I got 'Connector: Int. Ports 0-3' for the 4disk and the 6disk expander also.

Also, the messages are:

Consistency Check detected uncorrectable multiple medium errors: ( PD Int.Ports 0-3:2:2 Location 0x1651aa63 VD 0)

Puncturing bad block: PD Int.Ports 0-3:2:2 Location 0x1651aa63

Consistency Check detected uncorrectable multiple medium errors: ( PD Int.Ports 0-3:2:4 Location 0x16513b66 VD 0)

Puncturing bad block: PD Int.Ports 0-3:2:4 Location 0x16513b66

Consistency Check detected uncorrectable multiple medium errors: ( PD Int.Ports 0-3:2:0 Location 0x13c3085 VD 0)

Puncturing bad block: PD Int.Ports 0-3:2:0 Location 0x13c3085

Consistency Check detected uncorrectable multiple medium errors: ( PD Int.Ports 0-3:2:1 Location 0x13c3005 VD 0)

Puncturing bad block: PD Int.Ports 0-3:2:1 Location 0x13c3005

So this refers to 4 disks as if they having a bad block? 4 out of 10 new disks? No way!

Please note that for all disks I have 'Media Error Count: 0' and 'Pred Fail Count: 0' in the RAID web console.

PLEASE help me understand and resolve the problem.

Thank You for reading this and hopefully trying to help me!

idata · ‎05-06-2013

Dear mr.teecee,

First of all, validate S.M.A.R.T. values.

Boot to Linux (can be some LiveCD) and invoke the following command on each disk:

smartctl -A /dev/[hdd]

for example:

smartctl -A /dev/sda

smartctl -A /dev/sg0

it could vary depending on configuration. You can check drive name assigment in dmesg.

Put output here and I will check your results.

Greetings,

Saelic Vogel

idata · ‎05-06-2013

Thanks, I'll do that probably this afternoon/evening, since the RAID is in use, so the easiest way is to shut down server for a while and get the disks out. (I think I cannot use liveCD on the server since the RAID-controller shows only 1 volume as a disk...)

Thanks,

Tamas

idata · ‎05-06-2013

It depends. There is possibility to check each disk independently on some Linux drivers.

For example: if drive is advertaising itself as /dev/sda, it could be possible to refer to each physical disks of it's RAID group via block device /dev/sgN (N=0,1,2...).

Invoke ls -l /dev | grep sd* command and see what's there.

idata · ‎05-06-2013

It depends. There is possibility to check each disk independently on some Linux drivers.

Ahh, I see.

Do You know if Ubuntu Server 13.04 or 12.04 LTS supports it? (I'm downloading 13.04 64bit at the moment, so hopefully I can do it in 1 shutdown without getting the HDDs out.)

Thanks!

idata · ‎05-06-2013

Kernel should support it, but I'm not sure if Ubuntu LiveCD has smartmontools package.

I don't remember when I was using LiveCD so I am unable to avice you which one to pick. Anyway, the main point is 1) what you've got in /dev/ and 2) smartmontools package installed.

You can put results here, I will help you.

idata · ‎05-06-2013

Hi,

I've put up the test result here: https://www.dropbox.com/sh/365ef7jgk6byz87/lY6FzHktAL Dropbox - WD RED SMART test

I can see only 1 error in WDRED08.txt: 2 RAW READ ERRORS, but all the other HDDs were OK. Or did I miss something?

The SMARTs are on the bottom of the txt-s. I could dig them with SentinelHD after getting the disks out one-by-one...

Thank You really much for helping me on these.

idata · ‎05-06-2013

In WDRED05 there is Spin Up Time noted as 6700ms which is a little bit longer than usual (compare to 4266 in WDRED07). It's also puzzling that other disks got zero thers. Anyway, it is not related to your problem.

Raw Read Error Rate = 2 in WDRED08 deffinitely points, that disk replacement is neccessary (especialy in mission-critical environments). I would start from this point to solve the issue - replace this disk. It is big probability that the consistency error is caused by this.

Other disks are fine. Do what I mentioned above.

S.V.

idata · ‎05-06-2013

Raw Read Error Rate = 2 in WDRED08 deffinitely points, that disk replacement is neccessary (especialy in mission-critical environments). I would start from this point to solve the issue - replace this disk. It is big probability that the consistency error is caused by this.

Other disks are fine. Do what I mentioned above.

Thanks, I got 9 out of 10 working in RAID5, and 1 as hot spare, so I'll replace the WDRED08 to the hotspare and do a consistency check again.

Also, I'm curious: the 'puncturing bad block' message is 4 times and referring to 4 different disks according to the error message: 0-3:2:2, 0-3:2:4, 0-3:2:0, 0-3:2:1

Do I know which disks are these?

The SMARTs were good except for my 08 disk, where there is 2 raw read error. 2 is 2, not 4 as it is stated in the error messages. All the others were clean according to the SMARTs :-\

Thanks again, I'll write my result of the consistency check after replacement!

idata · ‎05-07-2013

Thanks again, I'll write my result of the consistency check after replacement!

Haha! It seems You were right: 2 of 4 bad blocks message appeared shortly after started the copyback process.

If the volume won't have much load than 6 hours to complete the copyback and the consistency check

idata · ‎05-07-2013

>> 2 is 2, not 4 as it is stated in the error messages

It's not that simple. Basing on SMART records we can assume that there are 2 bad blocks for sure, but we must also assume that there could be also some pending bad blocks which where not yet reported by SMART attributes (Current Pending Sector or Raw Read Error Rate). Also there is Write error rate parameter which value can vary while disk is powered on and (basing on WD information) it's is updated when power cycle change (stragne for me, some time ago it was updated duting runtime, I must validate it).

If you are interested in my opinion, I would say that HW RAID is obsolete. I don't use RAID controlers for data security, because they don't offer good protection level. The only thing you've got is dispersion of data which in reality you can easly loose. We got such briliant file system which name is ZFS. It solves a lot of problems, it's much more reliable and error resilent. It is the file system of current storage needs - even M$ plagiated it to their tragic Windows Server 2012 system - known as 'storage pools'.

You see, in your situation you don't even know if data which you put on those bad sectors are still OK - there is no integrity control. The consistecy check only informs you that there is some inconsistency between physical drives which must be corrected. But how the controller knows which data are correct? If you have critical data, the best way to protect them is ZFS, no $10000 controllers.

Anyway, which block device reffered to physical disks in your situation? Was it /dev/sgN?