cancel
Showing results for 
Search instead for 
Did you mean: 

S3610 SSDs have failed "READ/WRITE FPDMA QUEUED" ATA commands, frozen, then link reset

ASmit32
New Contributor II

Hi,

I have a new Linux machine with two DC S3610 1.6TB SSDs. It's Debian jessie so kernel 3.6.17. Since around one month after installation these errors started appearing:

Jul 30 16:30:59 snaps kernel: [186914.249429] ata1.00: exception Emask 0x0 SAct 0x3 SErr 0x0 action 0x6 frozen

Jul 30 16:30:59 snaps kernel: [186914.250465] ata1.00: failed command: WRITE FPDMA QUEUED

Jul 30 16:30:59 snaps kernel: [186914.251505] ata1.00: cmd 61/08:00:39:db:8e/00:00:09:00:00/40 tag 0 ncq 4096 out

Jul 30 16:30:59 snaps kernel: [186914.251505] res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)

Jul 30 16:30:59 snaps kernel: [186914.253613] ata1.00: status: { DRDY }

Jul 30 16:30:59 snaps kernel: [186914.254781] ata1.00: failed command: WRITE FPDMA QUEUED

Jul 30 16:30:59 snaps kernel: [186914.255810] ata1.00: cmd 61/08:08:71:fc:4e/00:00:66:00:00/40 tag 1 ncq 4096 out

Jul 30 16:30:59 snaps kernel: [186914.255810] res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)

Jul 30 16:30:59 snaps kernel: [186914.257940] ata1.00: status: { DRDY }

Jul 30 16:30:59 snaps kernel: [186914.259086] ata1: hard resetting link

Jul 30 16:31:00 snaps kernel: [186914.577366] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)

Jul 30 16:31:00 snaps kernel: [186914.578307] ata1.00: configured for UDMA/133

Jul 30 16:31:00 snaps kernel: [186914.578310] ata1.00: device reported invalid CHS sector 0

Jul 30 16:31:00 snaps kernel: [186914.578311] ata1.00: device reported invalid CHS sector 0

Jul 30 16:31:00 snaps kernel: [186914.578316] ata1: EH complete

The error is always the same, and the only thing on ata1.00 is one of the SSDs. I switched the two SSDs around and the problem followed the same SSD.

I can't force the error to happen on demand, it just seems to happen every other day or so, though not at the same time of day. All IO is held up briefly while the link is reset. The drive passes a SMART long self-test.

So is this drive faulty? If not, what can I try to fix this? If so, is there an easy way to prove it for RMA purposes?

Jul 27 05:59:30 snaps kernel: [ 33.054376] ata1.00: ATA-9: INTEL SSDSC2BX016T4, G2010110, max UDMA/133

Jul 27 05:59:30 snaps kernel: [ 33.054474] ata1.00: 3125627568 sectors, multi 1: LBA48 NCQ (depth 31/32)

Jul 27 05:59:30 snaps kernel: [ 33.054567] ata2.00: ATA-9: INTEL SSDSC2BX016T4, G2010110, max UDMA/133

Jul 27 05:59:30 snaps kernel: [ 33.054657] ata2.00: 3125627568 sectors, multi 1: LBA48 NCQ (depth 31/32)

$ sudo smartctl -i /dev/sda

smartctl 6.4 2014-10-07 r4002 [x86_64-linux-3.16.0-4-amd64] (local build)

Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===

Device Model: INTEL SSDSC2BX016T4

Serial Number: BTHC511604V41P6PGN

LU WWN Device Id: 5 5cd2e4 04b7b1bfa

Firmware Version: G2010110

User Capacity: 1,600,321,314,816 bytes [1.60 TB]

Sector Sizes: 512 bytes logical, 4096 bytes physical

Rotation Rate: Solid State Device

Form Factor: 2.5 inches

Device is: Not in smartctl database [for details use: -P showall]

ATA Version is: ACS-2 T13/2015-D revision 3

SATA Version is: SATA 2.6, 6.0 Gb/s (current: 6.0 Gb/s)

Local Time is: Fri Jul 31 11:04:09 2015 UTC

SMART support is: Available - device has SMART capability.

SMART support is: Enabled

$ sudo smartctl -i /dev/sdb

smartctl 6.4 2014-10-07 r4002 [x86_64-linux-3.16.0-4-amd64] (local build)

Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===

Device Model: INTEL SSDSC2BX016T4

Serial Number: BTHC511604SD1P6PGN

LU WWN Device Id: 5 5cd2e4 04b7b1ba2

Firmware Version: G2010110

User Capacity: 1,600,321,314,816 bytes [1.60 TB]

Sector Sizes: 512 bytes logical, 4096 bytes physical

Rotation Rate: Solid State Device

Form Factor: 2.5 inches

Device is: Not in smartctl database [for details use: -P showall]

ATA Version is: ACS-2 T13/2015-D revision 3

SATA Version is: SATA 2.6, 6.0 Gb/s (current: 6.0 Gb/s)

Local Time is: Fri Jul 31 11:04:35 2015 UTC

SMART support is: Available - device has SMART capability.

SMART support is: Enabled

Message was edited by: Andy Smith Now seeing same problems with other SSD, so this is not restricted to a single drive.

45 REPLIES 45

jbenavides
Valued Contributor II

Hello andreykorolyov,

If the new SSD's are not working as expected in your system and you would like to exchange them, we would advise you contact the place of purchase, even more if you obtained them as samples or for testing purposes.

Please take into consideration that for warranty issues, you should http://www.intel.com/p/en_US/support/contactsupport Contact Support to engage a support agent in your nearest support center.

Thank you Jonathan, will contact the SC next day,

would Intel engineering team be interested in a further investigation of the issue? I can easily reproduce the problem on a 20-minute FIO test run on any SSD from set we bought. Firmware updater says that the running version is latest, so the problem is bound to the specific SSD hardware I suppose. Again, the issue belongs at least to ten disks from our part and I strongly believe that the rest is affected as well, so I`d like to help to fix this issue instead of only giving those back. For now it looks that both C602 and C220 chipsets are affected, and I can confirm that both SATA and SAS downlinks are exposing the issue on C602.

ASmit32
New Contributor II

I've since seem the same problems with the other drive in the pair, so I now find it hard to believe that this is a single faulty drive. I've edited the post title to reflect this.

I do not now know how to proceed. I need to know if the problem is a bug in the Linux kernel, in the SATA chipset or in the drives themselves.

It seems I can make the problem go away by disabling NCQ, but this reduces the performance of the drive to around 25% of max IOPS so is not a long term solution.

This server has an Intel C220 SATA chipset:

00:1f.2 SATA controller: Intel Corporation 8 Series/C220 Series Chipset Family 6-port SATA Controller 1 [AHCI mode] (rev 05) (prog-if 01 [AHCI 1.0])

Subsystem: Super Micro Computer Inc Device 086d

Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+

Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- SERR-

Latency: 0

Interrupt: pin A routed to IRQ 164

Region 0: I/O ports at f070 [size=8]

Region 1: I/O ports at f060 [size=4]

Region 2: I/O ports at f050 [size=8]

Region 3: I/O ports at f040 [size=4]

Region 4: I/O ports at f020 [size=32]

Region 5: Memory at fb312000 (32-bit, non-prefetchable) [size=2K]

Capabilities: [80] MSI: Enable+ Count=1/1 Maskable- 64bit-

Address: fee002b8 Data: 0000

Capabilities: [70] Power Management version 3

Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot+,D3cold-)

Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-

Capabilities: [a8] SATA HBA v1.0 BAR4 Offset=00000004

Kernel driver in use: ahci

00: 86 80 02 8c 07 04 b0 02 05 01 06 01 00 00 00 00

10: 71 f0 00 00 61 f0 00 00 51 f0 00 00 41 f0 00 00

20: 21 f0 00 00 00 20 31 fb 00 00 00 00 d9 15 6d 08

30: 00 00 00 00 80 00 00 00 00 00 00 00 0b 01 00 00

Are you aware of any problems with C220 chipset and S3610 drives?

jbenavides
Valued Contributor II

We are very interested in this issue and we'll need to do more research about it. We will contact you via Private Message individually with further details and to request any additional information.

Hi guys,

Please post here when you have some progress on the subject.

I am having similar problem with S3710 800G, connected to LSI MegaRAID SAS 9271-4i via Supermicro expander backplane. The problems appear at almost zero load.

I have 4x800G S3710 in RAID10 array.

On two of the ports I was getting errors like this (errors are from LSI storage manager):

Aug 2 05:07:06 h19 MR_MONITOR[3772]: Controller ID: 0 PD Reset: PD # 012= -:-:3, Critical # 012= 3, Path =# 012 0x5003048000F3BE0F# 012Event ID:268Aug 2 05:07:07 h19 MR_MONITOR[3772]: Controller ID: 0 Command timeout on PD: PD # 012= -:-:3No addtional sense information, CDB =0x48 0xd0 0xc0 0x00 0x00 0x00 0x00 0x00 0x08 0x00, Sense = , Path =# 012 0x5003048000F3BE0F# 012Event ID:267Aug 2 05:07:07 h19 MR_MONITOR[3772]: Controller ID: 0 Command timeout on PD: PD # 012= -:-:3No addtional sense information, CDB =0x58 0xd0 0xc0 0x00 0x00 0x00 0x00 0x00 0x08 0x00, Sense = , Path =# 012 0x5003048000F3BE0F# 012Event ID:267Aug 2 05:07:07 h19 MR_MONITOR[3772]: Controller ID: 0 Unexpected sense: PD # 012= -:-:3Power on, reset, or bus device reset occurred, CDB =0x2a 0x00 0x00 0xc0 0xd0 0x58 0x00 0x00 0x08 0x00, Sense =0x70 0x00 0x06 0x00 0x00 0x00 0x00 0x0a 0x00 0x00 0x00 0x00 0x29 0x00 0x00 0x00 0x00 0x00

Contacted our vendor and they recommended to flash the firmware of the SSD disks. However just to make sure that everything with the backplane is ok we swapped the bays of all of the four disks: swapped port0 with port2, port1 with port3, and the problem somehow disappeared at least for the last 3-4 days.