I have 6 P4510 in a RAID 6 array. Under seemingly random circumstances, I am getting kernel messages such as:
Dec 18 01:34:26 dimebox kernel: nvme nvme0: I/O 55 QID 52 timeout, reset controller
The issue seems to be triggered more frequently during periods of high I/O, lots of reads and simultaneous writes. The machine has yet to fail, but when the controller is reset, all I/O operations are stalled.
The operating system pertinent information is:
[root@dimebox ~]# cat /etc/redhat-release ; uname -a
CentOS Linux release 7.6.1810 (Core)
Linux dimebox.stata.com 3.10.0-957.1.3.el7.x86_64 #1 SMP Thu Nov 29 14:49:43 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
[root@dimebox ~]# df -hl | grep dev
/dev/md127 7.0T 1.4T 5.3T 21% /
devtmpfs 63G 0 63G 0% /dev
tmpfs 63G 4.0K 63G 1% /dev/shm
/dev/md125 249M 12M 238M 5% /boot/efi
[root@dimebox ~]# cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4] [raid1]
md125 : active raid1 nvme5n1p3 nvme2n1p3 nvme4n1p3 nvme3n1p3 nvme0n1p3 nvme1n1p3
254912 blocks super 1.0 [6/6] [UUUUUU]
bitmap: 0/1 pages [0KB], 65536KB chunk
md126 : active raid6 nvme3n1p2 nvme1n1p2 nvme5n1p2 nvme4n1p2 nvme0n1p2 nvme2n1p2
16822272 blocks super 1.2 level 6, 512k chunk, algorithm 2 [6/6] [UUUUUU]
md127 : active raid6 nvme1n1p1 nvme3n1p1 nvme5n1p1 nvme4n1p1 nvme0n1p1 nvme2n1p1
7516188672 blocks super 1.2 level 6, 512k chunk, algorithm 2 [6/6] [UUUUUU]
bitmap: 8/14 pages [32KB], 65536KB chunk
unused devices: <none>
As you can see, I also overprovisioned the drives, leaving approximately 7GB free on each:
n[root@dimebox ~]# parted /dev/nvme0n1 unit MB print
Model: NVMe Device (nvme)
Disk /dev/nvme0n1: 2000399MB
Sector size (logical/physical): 512B/512B
Partition Table: gpt
Number Start End Size File system Name Flags
1 1.05MB 1924281MB 1924280MB raid
2 1924281MB 1928592MB 4312MB raid
3 1928592MB 1928853MB 261MB fat16 raid
In the attached nvme.txt, you can see the output of isdct show -a -intelssd. nvme2.txt contains the kernel's ring buffer filtered on nvme entries from last boot. What I find most interesting is that not every "timeout, aborting" entry will trigger the reset of the controller. I am also not certain if these timeout entries are noticed by the user, but the "timeout, reset controller" issues are.
Does Intel have any idea what could be triggering these events, more importantly, how to avoid them?
And as a follow up, I also have another system, almost exactly like the first but without a RAID 6 configuration (which is using EXT4). The single drive system is using XFS. I can create the issue even easier. I have a directory with 913876 files:
[root@dimebuild home]# ls
here2 here22 here2.tar.gz here3 here4 here.tar.gz
[root@dimebuild home]# find . | wc -l
I attempt to make an archive using parallel gzip:
tar cf - ./here2 ./here3 ./here4 ./here.tar.gz here22 | pigz > here2.tar.gz
The controller resets quite predictably. See the attached dmi.txt (dmidecode output) and lspci.txt (lspci -v -v -v -v output).
Just to update, this appears to be function of random read I/O. To verify:
cat /dev/zero > myfile
cat /dev/nvme0n1 > myfile
Neither of the above experienced any timeouts nor controller resets, they are also not random reads. In my tests, I created 300GB files with both commands without issue.
tar cf file.tar ./<some directories with 2.2 million files>
I can get the timeout/controller reset to occur multiple times within a short interval (1 to 10 minutes) .
As it turns out, the server was incorrectly configured. Once the proper amount of RAM was installed, the IRQ timeout errors have gone away. Please feel free to close this case, thank you.