Solidigm

ASmit32 · ‎07-31-2015

Hi,

I have a new Linux machine with two DC S3610 1.6TB SSDs. It's Debian jessie so kernel 3.6.17. Since around one month after installation these errors started appearing:

Jul 30 16:30:59 snaps kernel: [186914.249429] ata1.00: exception Emask 0x0 SAct 0x3 SErr 0x0 action 0x6 frozen

Jul 30 16:30:59 snaps kernel: [186914.250465] ata1.00: failed command: WRITE FPDMA QUEUED

Jul 30 16:30:59 snaps kernel: [186914.251505] ata1.00: cmd 61/08:00:39:db:8e/00:00:09:00:00/40 tag 0 ncq 4096 out

Jul 30 16:30:59 snaps kernel: [186914.251505] res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)

Jul 30 16:30:59 snaps kernel: [186914.253613] ata1.00: status: { DRDY }

Jul 30 16:30:59 snaps kernel: [186914.254781] ata1.00: failed command: WRITE FPDMA QUEUED

Jul 30 16:30:59 snaps kernel: [186914.255810] ata1.00: cmd 61/08:08:71:fc:4e/00:00:66:00:00/40 tag 1 ncq 4096 out

Jul 30 16:30:59 snaps kernel: [186914.255810] res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)

Jul 30 16:30:59 snaps kernel: [186914.257940] ata1.00: status: { DRDY }

Jul 30 16:30:59 snaps kernel: [186914.259086] ata1: hard resetting link

Jul 30 16:31:00 snaps kernel: [186914.577366] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)

Jul 30 16:31:00 snaps kernel: [186914.578307] ata1.00: configured for UDMA/133

Jul 30 16:31:00 snaps kernel: [186914.578310] ata1.00: device reported invalid CHS sector 0

Jul 30 16:31:00 snaps kernel: [186914.578311] ata1.00: device reported invalid CHS sector 0

Jul 30 16:31:00 snaps kernel: [186914.578316] ata1: EH complete

The error is always the same, and the only thing on ata1.00 is one of the SSDs. I switched the two SSDs around and the problem followed the same SSD.

I can't force the error to happen on demand, it just seems to happen every other day or so, though not at the same time of day. All IO is held up briefly while the link is reset. The drive passes a SMART long self-test.

So is this drive faulty? If not, what can I try to fix this? If so, is there an easy way to prove it for RMA purposes?

Jul 27 05:59:30 snaps kernel: [ 33.054376] ata1.00: ATA-9: INTEL SSDSC2BX016T4, G2010110, max UDMA/133

Jul 27 05:59:30 snaps kernel: [ 33.054474] ata1.00: 3125627568 sectors, multi 1: LBA48 NCQ (depth 31/32)

Jul 27 05:59:30 snaps kernel: [ 33.054567] ata2.00: ATA-9: INTEL SSDSC2BX016T4, G2010110, max UDMA/133

Jul 27 05:59:30 snaps kernel: [ 33.054657] ata2.00: 3125627568 sectors, multi 1: LBA48 NCQ (depth 31/32)

$ sudo smartctl -i /dev/sda

smartctl 6.4 2014-10-07 r4002 [x86_64-linux-3.16.0-4-amd64] (local build)

=== START OF INFORMATION SECTION ===

Device Model: INTEL SSDSC2BX016T4

Serial Number: BTHC511604V41P6PGN

LU WWN Device Id: 5 5cd2e4 04b7b1bfa

Firmware Version: G2010110

User Capacity: 1,600,321,314,816 bytes [1.60 TB]

Sector Sizes: 512 bytes logical, 4096 bytes physical

Rotation Rate: Solid State Device

Form Factor: 2.5 inches

Device is: Not in smartctl database [for details use: -P showall]

ATA Version is: ACS-2 T13/2015-D revision 3

SATA Version is: SATA 2.6, 6.0 Gb/s (current: 6.0 Gb/s)

Local Time is: Fri Jul 31 11:04:09 2015 UTC

SMART support is: Available - device has SMART capability.

SMART support is: Enabled

$ sudo smartctl -i /dev/sdb

smartctl 6.4 2014-10-07 r4002 [x86_64-linux-3.16.0-4-amd64] (local build)

=== START OF INFORMATION SECTION ===

Device Model: INTEL SSDSC2BX016T4

Serial Number: BTHC511604SD1P6PGN

LU WWN Device Id: 5 5cd2e4 04b7b1ba2

Firmware Version: G2010110

User Capacity: 1,600,321,314,816 bytes [1.60 TB]

Sector Sizes: 512 bytes logical, 4096 bytes physical

Rotation Rate: Solid State Device

Form Factor: 2.5 inches

Device is: Not in smartctl database [for details use: -P showall]

ATA Version is: ACS-2 T13/2015-D revision 3

SATA Version is: SATA 2.6, 6.0 Gb/s (current: 6.0 Gb/s)

Local Time is: Fri Jul 31 11:04:35 2015 UTC

SMART support is: Available - device has SMART capability.

SMART support is: Enabled

Message was edited by: Andy Smith Now seeing same problems with other SSD, so this is not restricted to a single drive.

jbenavides · ‎08-12-2015

Hello dchepishev,

We are looking into this issue and an update will be provided once we have more information.

Please keep us informed in case the issue reappears.

DNeri · ‎08-18-2015

We're seeing this same issue on 5 identical servers with Supermicro X10SLM+-LN4F motherboards in Supermicro 813MT-350CB 1U chassis, with one S3610 SSD in each machine, connected via the hot-swap backplane on the chassis to the onboard Intel C224 chipset 6 Gbps SATA ports.

One of the 5 machines has one additional spare S3610 - this has not shown any failures/resets, but there's no I/O being performed on it.

These S3610 SSDs were installed in the beginning of July, and the first command failure/bus reset occurred within a couple of days. It doesn't occur every day, and at the most a couple of times per day (currently we only have logs for 4-5 weeks back). I/O load is not high.

The machines also have DC S3500 series SSDs, which have been working flawlessly for the last year.

OS: Debian 7 (Wheezy), 64-bit

BIOS: AMI BIOS, version 1.1a.

There is currently no SMART status monitoring running on these machines.

Except from Linux kernel log output:

Aug 14 11:07:06 hotel kernel: [3273761.737966] ata2.00: exception Emask 0x0 SAct 0x30000000 SErr 0x0 action 0x6 frozen

Aug 14 11:07:06 hotel kernel: [3273761.738054] ata2.00: failed command: WRITE FPDMA QUEUED

Aug 14 11:07:06 hotel kernel: [3273761.738103] ata2.00: cmd 61/10:e0:c0:70:05/00:00:10:00:00/40 tag 28 ncq 8192 out

Aug 14 11:07:06 hotel kernel: [3273761.738105] res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)

Aug 14 11:07:06 hotel kernel: [3273761.738238] ata2.00: status: { DRDY }

Aug 14 11:07:06 hotel kernel: [3273761.738281] ata2.00: failed command: WRITE FPDMA QUEUED

Aug 14 11:07:06 hotel kernel: [3273761.738334] ata2.00: cmd 61/10:e8:c0:70:25/00:00:13:00:00/40 tag 29 ncq 8192 out

Aug 14 11:07:06 hotel kernel: [3273761.738336] res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)

Aug 14 11:07:06 hotel kernel: [3273761.738467] ata2.00: status: { DRDY }

Aug 14 11:07:06 hotel kernel: [3273761.738512] ata2: hard resetting link

Aug 14 11:07:07 hotel kernel: [3273762.057688] ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300)

Aug 14 11:07:07 hotel kernel: [3273762.058814] ata2.00: ACPI cmd ef/10:06:00:00:00:00 (SET FEATURES) succeeded

Aug 14 11:07:07 hotel kernel: [3273762.058825] ata2.00: ACPI cmd f5/00:00:00:00:00:00 (SECURITY FREEZE LOCK) filtered out

Aug 14 11:07:07 hotel kernel: [3273762.058833] ata2.00: ACPI cmd b1/c1:00:00:00:00:00 (DEVICE CONFIGURATION OVERLAY) filtered out

Aug 14 11:07:07 hotel kernel: [3273762.060141] ata2.00: ACPI cmd ef/10:06:00:00:00:00 (SET FEATURES) succeeded

Aug 14 11:07:07 hotel kernel: [3273762.060149] ata2.00: ACPI cmd f5/00:00:00:00:00:00 (SECURITY FREEZE LOCK) filtered out

Aug 14 11:07:07 hotel kernel: [3273762.060155] ata2.00: ACPI cmd b1/c1:00:00:00:00:00 (DEVICE CONFIGURATION OVERLAY) filtered out

Aug 14 11:07:07 hotel kernel: [3273762.060500] ata2.00: configured for UDMA/133

Aug 14 11:07:07 hotel kernel: [3273762.060510] ata2.00: device reported invalid CHS sector 0

Aug 14 11:07:07 hotel kernel: [3273762.060515] ata2.00: device reported invalid CHS sector 0

Aug 14 11:07:07 hotel kernel: [3273762.060528] ata2: EH complete

SMART info & attributes (from one of the machines):

root@hotel:~# smartctl -iA /dev/sdb

smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.2.0-4-amd64] (local build)

=== START OF INFORMATION SECTION ===

Device Model: INTEL SSDSC2BX400G4

Serial Number: BTHC514101W7400VGN

LU WWN Device Id: 5 5cd2e4 04b7ca92d

Firmware Version: G2010110

User Capacity: 400,088,457,216 bytes [400 GB]

Sector Sizes: 512 bytes logical, 4096 bytes physical

Device is: Not in smartctl database [for details use: -P showall]

ATA Version is: 8

ATA Standard is: ACS-2 revision 3

Local Time is: Tue Aug 18 15:28:28 2015 UTC

SMART support is: Available - device has SMART capability.

SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===

SMART Attributes Data Structure revision number: 1

Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE

5 Reallocated_Sector_Ct 0x0032 100 100 000 Old_age Always - 0

9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 1028

12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 2

170 Unknown_Attribute 0x0033 100 100 010 Pre-fail Always - 0

171 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 0

172 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 0

174 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 1

175 Program_Fail_Count_Chip 0x0033 100 100 010 Pre-fail Always - 21563578034

183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0

184 End-to-End_Error 0x0033 100 100 090 Pre-fail Always - 0

187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0

190 Airflow_Temperature_Cel 0x0022 079 077 000 Old_age Always - 21 (Min/Max 20/24)

192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 1

194 Temperature_Celsius 0x0022 100 100 000 Old_age Always - 21

197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0

199 UDMA_CRC_Error_Count 0x003e 100 100 000 Old_age Always - 0

225 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 3207

226 Load-in_Time 0x0032 100 100 000 Old_age Always - 0

227 Torq-amp_Count 0x0032 100 100 000 Old_age Always - 13

228 Power-off_Retract_Count 0x0032 100 100 000 Old_age Always - 61697

232 Available_Reservd_Space 0x0033 100 100 010 Pre-fail Always - 0

233 Media_Wearout_Indicator 0x0032 100 100 000 Old_age Always - 0

234 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 0

241 Total_LBAs_Written 0x0032 100 100 000 Old_age Always - 3207

242 Total_LBAs_Read 0x0032 100 100 000 Old_age Always - 506

PCI info for SATA controller (from lspci):

00:1f.2 SATA controller: Intel Corporation Lynx Point 6-port SATA Controller 1 [AHCI mode] (rev 05) (prog-if 01 [AHCI 1.0])

Subsystem: Super Micro Computer Inc Device 0806

Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+

Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- SERR-

Latency: 0

Interrupt: pin B routed to IRQ 51

Region 0: I/O ports at f050 [size=8]

Region 1: I/O ports at f040 [size=4]

Region 2: I/O ports at f030 [size=8]

Region 3: I/O ports at f020 [size=4]

Region 4: I/O ports at f000 [size=32]...

AKoro1 · ‎08-18-2015

In a meantime you may disable NCQ via libata: libata.force=X:noncq for the specific link. Reducing queue_length to 1 was not helpful for me, instead you probably should completely eliminate possibility of issuing NCQ tags. Hopefully the new firmware with fix will hit the public this week and this bad hack can be thrown out.

ASmit32 · ‎08-20-2015

Hi Andrey,

Have you had any indication that there is a fix for this in forthcoming firmware then?

So far I've had no response to asking for updates on this and I need to make a decision as to whether I'm going to wait or return for refund and buy something else.

AKoro1 · ‎08-20-2015

Hi Andy,

in a phone conversation support engineer indicated an approximate date of the firmware release as an end of the current week a week ago, though the could be obviously delayed a little, the corresponding ticket is still open as I asked to hold it until the complete resolution. I am relatively fine with the "workaround" for now because our hot caches are not likely to generate more than 2K IOPS per caching device ever, so I changed my mind over the possibility of utilizing buggy devices as is without issuing an RMA. Over couple of months the tiering scheme in our datacenter is a subject to change and a single-queued SSD cannot be an option anymore for a tier-1 "iops-dampeners". Please share your RMA experience if you decide not to wait for a FW release.

Solidigm

S3610 SSDs have failed "READ/WRITE FPDMA QUEUED" ATA commands, frozen, then link reset