Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Highlighted
Beginner
1,496 Views

Medium error - very frequent in multiple drives

In a very short span ~ 2 months, I have had two SSD drives fail in similar ways and hence want to post some information I collected and get feedback.

Vendor confirmed that the drives were tested prior to deploying.

Failure signature - some of the sectors become unreadable

/tmp/raid/var/log @mw136.sjc# dd if=/dev/sda of=/dev/null bs=512 skip=1007297200 count=1

1+0 records in

1+0 records out

512 bytes (512 B) copied, 0.000221492 s, 2.3 MB/s

/tmp/raid/var/log @mw136.sjc# dd if=/dev/sda of=/dev/null bs=512 skip=1007297210 count=1

dd: reading '/dev/sda': Input/output error

0+0 records in

0+0 records out

0 bytes (0 B) copied, 0.28118 s, 0.0 kB/s

/tmp/raid/var/log @mw136.sjc# dd if=/dev/sda of=/dev/null bs=512 skip=1007297300 count=1

1+0 records in

1+0 records out

512 bytes (512 B) copied, 0.000207949 s, 2.5 MB/s

So far in my system, 3 drives have failed within a short span. One completely dead and another two with symptoms as shown above.

I'm guessing that "fstrim" might be causing this. This is just a hunch and I have no conclusive evidence.

In another system with drives from non-Intel vendor, enabling fstrim was causing XFS filesystem panics and system instability (freezes etc).

The system with Intel drives has CentOS 7.2 with MD RAID-0

Since MD RAID-0 by default disables fstrim, I used "raid0.devices_discard_performance=Y" module parameter

A note attached in the Linux kernel sources (linux-3.10.0.327.36.3.el7/driver/md/raid0.c) say the following about the module parameter

/* Unfortunately, some devices have awful discard performance,

* especially for small sized requests. This is particularly

* bad for RAID0 with a small chunk size resulting in a small

* DISCARD requests hitting the underlaying drives.

* Only allow DISCARD if the sysadmin confirms that all devices

* in use can handle small DISCARD requests at reasonable speed,

* by setting a module parameter.

*/

Summary:

PS: Refer to the attachments for detailed SMART data and other logs

1) The drive seems to have correct partition alignment

2) SMART data seems to indicate that the drive has 90% remaining life

3) SMART data shows that "91858" LBAs were written to the drive, which is pretty low for the drive to fail.

SMART devstat data shows following - not sure which information is reliable

1 0x018 6 6020013265 Logical Sectors Written

1 0x020 6 29657954 Number of Write Commands

Questions:

1) Is the SSD model (INTEL SSDSC2BB016T7) susceptible to the warning posted in Linux kernel sources (see the note above) ? Are any recent (post 2010) drives impacted by that module parameter?

2) Based on this attached information, is it possible to know why certain sectors are unreadable (eg: failures induced by too many writes, other issues evident in the logs/SMART data etc)

3) If the attached information is insufficient to conclude why the failure happened, what do you recommend as the information to be collected ?

4) Are there any know gotchas around fstrim and Intel SSD drives? <...

0 Kudos
10 Replies
Highlighted
Community Manager
26 Views

Hello Kthommandra,

 

 

Thanks for bringing this situation to our attention. We'd like to engage additional resources in investigating on this situation, please allow us some time and we'll get back to you with an update.

 

 

Regards,

 

Nestor C
0 Kudos
Highlighted
Beginner
26 Views

Thank you for taking a look.

Another observation that we made is that when /sys/block/XXX/queue/nomerges is set to 2 then fstrim is really slow and its fast when the value is 0/1.

What is the recommended size for the trim requests for these drives?

0 Kudos
Highlighted
Community Manager
26 Views

Hello Kthommandra,

 

 

After checking the logs, we would like you to please update the firmware version on the SSDs, as the one from logs is not the latest one.

 

 

For you to be able to do that, please download the https://downloadcenter.intel.com/download/26749/Intel-SSD-Data-Center-Tool Intel® SSD Data Center Tool, the command to run the firmware update is: isdct load -intelssd X (X = Index of the drive)

 

 

 

The following changes are included in this firmware update:

 

 

• Correction to SMART attribute BBh and F1h increment behavior • Fix to drive behavior when power loss occurs during Secure Erase • Fixed issue where SCT Extended Status Code, Action Code and Function Code were not being cleared on a COMRESET • Fix to address occasional Standby Immediate failure • Legacy ATA commands not relevant in ACS-3 no longer aborted • Correction to drive behavior when running SMART selftest using Smartctl* and ABORT command received

 

 

At the same time, could you please provide us the nLog from the Intel® SSD Data Center Tool?

 

The command is: isdct dump -nlog -intelssd X

 

 

In case you would like to check the guide, https://www.intel.com/content/dam/support/us/en/documents/solid-state-drives/ssd-software/Intel_SSD_... here it is.

 

 

We will be waiting for your response.

 

 

Regards,

 

Nestor C
0 Kudos
Highlighted
Beginner
26 Views

I have updated all the drives in the affected system to the latest firmware

===

Firmware : N2010112

FirmwareUpdateAvailable : The selected Intel SSD contains current firmware as of this tool release.

===

During FW update I noticed the following errors in dmesg, I think these are fine.

Jun 10 11:14:13 kernel: ata5.00: exception Emask 0x0 SAct 0x0 SErr 0x0 actt

ion 0x6 frozen

Jun 10 11:14:13 kernel: ata5.00: failed command: DOWNLOAD MICROCODE

Jun 10 11:14:13 kernel: ata5.00: cmd 92/03:a4:00:00:07/00:00:00:00:00/40 tt

ag 19 pio 83968 out# 012 res 40/00:60:27:00:00/00:00:00:00:00/00 Emask 0xx

4 (timeout)

Jun 10 11:14:14 kernel: ata5.00: status: { DRDY }

Jun 10 11:14:14 kernel: ata5: hard resetting link

Jun 10 11:14:16 kernel: ata5: SATA link up 6.0 Gbps (SStatus 133 SControl

300)

Jun 10 11:14:16 kernel: ata5.00: configured for UDMA/133

Jun 10 11:14:16 kernel: ata5: EH complete

Jun 10 11:14:16 kernel: ata5.00: Enabling discard_zeroes_data

0 Kudos
Highlighted
Beginner
26 Views

Just to clarify, the errors described in the original post still persist after FW upgrade.

0 Kudos
Highlighted
Beginner
26 Views

This is the request log file from the impacted drive

0 Kudos
Highlighted
Community Manager
26 Views

Hi Kthommandra,

 

 

Thank you so much for all the information provided. We will keep you posted with any news.

 

 

Regards,

 

Nestor C
0 Kudos
Highlighted
Beginner
26 Views

Hi

We have another occurrence of the "exact" same issue in another server

Third failure in a short span of time.

Kindly escalate the investigation.

Device Model: INTEL SSDSC2BB016T7

Firmware Version: N2010112

User Capacity: 1,600,321,314,816 bytes [1.60 TB]

Sector Sizes: 512 bytes logical, 4096 bytes physical

Rotation Rate: Solid State Device

ATA Version is: ACS-3 (unknown minor revision code: 0x006d)

SATA Version is: SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)

We had re-enabled fstrim on this server with nomerges=0.

We don't know if the issue is related to fstrim but for now we have disabled fstrim again

I have attached the nlog from this newly failed drive.

Questions:

1) Can you share with us some information around the TRIM requirements of these drives - esp what kind of trim requests (size/frequency etc) can be detrimental to the drive etc.

2) Is there any kind of re-formatting that we could do and re-use these drives instead of replacing them?

0 Kudos
Highlighted
Beginner
26 Views

any update on the investigation?

A related question - when a drive is in this state, could we just re-format it and use it with potentially lowered capacity?

0 Kudos
Highlighted
Community Manager
26 Views

Hello Kthommandra,

 

 

We apologize for the long delay and we'd like you to please check your private messages inbox.

 

 

Please let us know.

 

 

Regards,

 

Nestor C
0 Kudos