Solidigm

kthom5 · ‎06-06-2017

In a very short span ~ 2 months, I have had two SSD drives fail in similar ways and hence want to post some information I collected and get feedback.

Vendor confirmed that the drives were tested prior to deploying.

Failure signature - some of the sectors become unreadable

/tmp/raid/var/log @mw136.sjc# dd if=/dev/sda of=/dev/null bs=512 skip=1007297200 count=1

1+0 records in

1+0 records out

512 bytes (512 B) copied, 0.000221492 s, 2.3 MB/s

/tmp/raid/var/log @mw136.sjc# dd if=/dev/sda of=/dev/null bs=512 skip=1007297210 count=1

dd: reading '/dev/sda': Input/output error

0+0 records in

0+0 records out

0 bytes (0 B) copied, 0.28118 s, 0.0 kB/s

/tmp/raid/var/log @mw136.sjc# dd if=/dev/sda of=/dev/null bs=512 skip=1007297300 count=1

1+0 records in

1+0 records out

512 bytes (512 B) copied, 0.000207949 s, 2.5 MB/s

So far in my system, 3 drives have failed within a short span. One completely dead and another two with symptoms as shown above.

I'm guessing that "fstrim" might be causing this. This is just a hunch and I have no conclusive evidence.

In another system with drives from non-Intel vendor, enabling fstrim was causing XFS filesystem panics and system instability (freezes etc).

The system with Intel drives has CentOS 7.2 with MD RAID-0

Since MD RAID-0 by default disables fstrim, I used "raid0.devices_discard_performance=Y" module parameter

A note attached in the Linux kernel sources (linux-3.10.0.327.36.3.el7/driver/md/raid0.c) say the following about the module parameter

/* Unfortunately, some devices have awful discard performance,

* especially for small sized requests. This is particularly

* bad for RAID0 with a small chunk size resulting in a small

* DISCARD requests hitting the underlaying drives.

* Only allow DISCARD if the sysadmin confirms that all devices

* in use can handle small DISCARD requests at reasonable speed,

* by setting a module parameter.

*/

Summary:

PS: Refer to the attachments for detailed SMART data and other logs

1) The drive seems to have correct partition alignment

2) SMART data seems to indicate that the drive has 90% remaining life

3) SMART data shows that "91858" LBAs were written to the drive, which is pretty low for the drive to fail.

SMART devstat data shows following - not sure which information is reliable

1 0x018 6 6020013265 Logical Sectors Written

1 0x020 6 29657954 Number of Write Commands

Questions:

1) Is the SSD model (INTEL SSDSC2BB016T7) susceptible to the warning posted in Linux kernel sources (see the note above) ? Are any recent (post 2010) drives impacted by that module parameter?

2) Based on this attached information, is it possible to know why certain sectors are unreadable (eg: failures induced by too many writes, other issues evident in the logs/SMART data etc)

3) If the attached information is insufficient to conclude why the failure happened, what do you recommend as the information to be collected ?

4) Are there any know gotchas around fstrim and Intel SSD drives? <...

kthom5 · ‎06-10-2017

Just to clarify, the errors described in the original post still persist after FW upgrade.

kthom5 · ‎06-10-2017

This is the request log file from the impacted drive

idata · ‎06-13-2017

Hi Kthommandra,

Thank you so much for all the information provided. We will keep you posted with any news.Regards,Nestor C

kthom5 · ‎06-21-2017

Hi

We have another occurrence of the "exact" same issue in another server

Third failure in a short span of time.

Kindly escalate the investigation.

Device Model: INTEL SSDSC2BB016T7

Firmware Version: N2010112

User Capacity: 1,600,321,314,816 bytes [1.60 TB]

Sector Sizes: 512 bytes logical, 4096 bytes physical

Rotation Rate: Solid State Device

ATA Version is: ACS-3 (unknown minor revision code: 0x006d)

SATA Version is: SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)

We had re-enabled fstrim on this server with nomerges=0.

We don't know if the issue is related to fstrim but for now we have disabled fstrim again

I have attached the nlog from this newly failed drive.

Questions:

1) Can you share with us some information around the TRIM requirements of these drives - esp what kind of trim requests (size/frequency etc) can be detrimental to the drive etc.

2) Is there any kind of re-formatting that we could do and re-use these drives instead of replacing them?

kthom5 · ‎06-24-2017

any update on the investigation?

A related question - when a drive is in this state, could we just re-format it and use it with potentially lowered capacity?

Solidigm

Medium error - very frequent in multiple drives