Solidigm

kthom5 · ‎06-06-2017

In a very short span ~ 2 months, I have had two SSD drives fail in similar ways and hence want to post some information I collected and get feedback.

Vendor confirmed that the drives were tested prior to deploying.

Failure signature - some of the sectors become unreadable

/tmp/raid/var/log @mw136.sjc# dd if=/dev/sda of=/dev/null bs=512 skip=1007297200 count=1

1+0 records in

1+0 records out

512 bytes (512 B) copied, 0.000221492 s, 2.3 MB/s

/tmp/raid/var/log @mw136.sjc# dd if=/dev/sda of=/dev/null bs=512 skip=1007297210 count=1

dd: reading '/dev/sda': Input/output error

0+0 records in

0+0 records out

0 bytes (0 B) copied, 0.28118 s, 0.0 kB/s

/tmp/raid/var/log @mw136.sjc# dd if=/dev/sda of=/dev/null bs=512 skip=1007297300 count=1

1+0 records in

1+0 records out

512 bytes (512 B) copied, 0.000207949 s, 2.5 MB/s

So far in my system, 3 drives have failed within a short span. One completely dead and another two with symptoms as shown above.

I'm guessing that "fstrim" might be causing this. This is just a hunch and I have no conclusive evidence.

In another system with drives from non-Intel vendor, enabling fstrim was causing XFS filesystem panics and system instability (freezes etc).

The system with Intel drives has CentOS 7.2 with MD RAID-0

Since MD RAID-0 by default disables fstrim, I used "raid0.devices_discard_performance=Y" module parameter

A note attached in the Linux kernel sources (linux-3.10.0.327.36.3.el7/driver/md/raid0.c) say the following about the module parameter

/* Unfortunately, some devices have awful discard performance,

* especially for small sized requests. This is particularly

* bad for RAID0 with a small chunk size resulting in a small

* DISCARD requests hitting the underlaying drives.

* Only allow DISCARD if the sysadmin confirms that all devices

* in use can handle small DISCARD requests at reasonable speed,

* by setting a module parameter.

*/

Summary:

PS: Refer to the attachments for detailed SMART data and other logs

1) The drive seems to have correct partition alignment

2) SMART data seems to indicate that the drive has 90% remaining life

3) SMART data shows that "91858" LBAs were written to the drive, which is pretty low for the drive to fail.

SMART devstat data shows following - not sure which information is reliable

1 0x018 6 6020013265 Logical Sectors Written

1 0x020 6 29657954 Number of Write Commands

Questions:

1) Is the SSD model (INTEL SSDSC2BB016T7) susceptible to the warning posted in Linux kernel sources (see the note above) ? Are any recent (post 2010) drives impacted by that module parameter?

2) Based on this attached information, is it possible to know why certain sectors are unreadable (eg: failures induced by too many writes, other issues evident in the logs/SMART data etc)

3) If the attached information is insufficient to conclude why the failure happened, what do you recommend as the information to be collected ?

4) Are there any know gotchas around fstrim and Intel SSD drives? <...

idata · ‎06-07-2017

Hello Kthommandra,

Thanks for bringing this situation to our attention. We'd like to engage additional resources in investigating on this situation, please allow us some time and we'll get back to you with an update.Regards,Nestor C

kthom5 · ‎06-09-2017

Thank you for taking a look.

Another observation that we made is that when /sys/block/XXX/queue/nomerges is set to 2 then fstrim is really slow and its fast when the value is 0/1.

What is the recommended size for the trim requests for these drives?

idata · ‎06-09-2017

Hello Kthommandra,

After checking the logs, we would like you to please update the firmware version on the SSDs, as the one from logs is not the latest one.For you to be able to do that, please download the https://downloadcenter.intel.com/download/26749/Intel-SSD-Data-Center-Tool Intel® SSD Data Center Tool, the command to run the firmware update is: isdct load -intelssd X (X = Index of the drive)The following changes are included in this firmware update:• Correction to SMART attribute BBh and F1h increment behavior • Fix to drive behavior when power loss occurs during Secure Erase • Fixed issue where SCT Extended Status Code, Action Code and Function Code were not being cleared on a COMRESET • Fix to address occasional Standby Immediate failure • Legacy ATA commands not relevant in ACS-3 no longer aborted • Correction to drive behavior when running SMART selftest using Smartctl* and ABORT command receivedAt the same time, could you please provide us the nLog from the Intel® SSD Data Center Tool?The command is: isdct dump -nlog -intelssd XIn case you would like to check the guide, https://www.intel.com/content/dam/support/us/en/documents/solid-state-drives/ssd-software/Intel_SSD_... here it is. We will be waiting for your response.Regards,Nestor C

kthom5 · ‎06-10-2017

I have updated all the drives in the affected system to the latest firmware

===

Firmware : N2010112

FirmwareUpdateAvailable : The selected Intel SSD contains current firmware as of this tool release.

===

During FW update I noticed the following errors in dmesg, I think these are fine.

Jun 10 11:14:13 kernel: ata5.00: exception Emask 0x0 SAct 0x0 SErr 0x0 actt

ion 0x6 frozen

Jun 10 11:14:13 kernel: ata5.00: failed command: DOWNLOAD MICROCODE

Jun 10 11:14:13 kernel: ata5.00: cmd 92/03:a4:00:00:07/00:00:00:00:00/40 tt

ag 19 pio 83968 out# 012 res 40/00:60:27:00:00/00:00:00:00:00/00 Emask 0xx

4 (timeout)

Jun 10 11:14:14 kernel: ata5.00: status: { DRDY }

Jun 10 11:14:14 kernel: ata5: hard resetting link

Jun 10 11:14:16 kernel: ata5: SATA link up 6.0 Gbps (SStatus 133 SControl

300)

Jun 10 11:14:16 kernel: ata5.00: configured for UDMA/133

Jun 10 11:14:16 kernel: ata5: EH complete

Jun 10 11:14:16 kernel: ata5.00: Enabling discard_zeroes_data

Solidigm

Medium error - very frequent in multiple drives