Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Pete_H_
Beginner
2,023 Views

Intel SSD DC P4510 Series, reset controller

I have 6 P4510 in a RAID 6 array. Under seemingly random circumstances, I am getting kernel messages such as:

 

Dec 18 01:34:26 dimebox kernel: nvme nvme0: I/O 55 QID 52 timeout, reset controller

 

The issue seems to be triggered more frequently during periods of high I/O, lots of reads and simultaneous writes. The machine has yet to fail, but when the controller is reset, all I/O operations are stalled.

 

The operating system pertinent information is:

 

[root@dimebox ~]# cat /etc/redhat-release ; uname -a

CentOS Linux release 7.6.1810 (Core)

Linux dimebox.stata.com 3.10.0-957.1.3.el7.x86_64 #1 SMP Thu Nov 29 14:49:43 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

 

[root@dimebox ~]# df -hl | grep dev

/dev/md127   7.0T 1.4T 5.3T 21% /

devtmpfs     63G   0  63G  0% /dev

tmpfs      63G 4.0K  63G  1% /dev/shm

/dev/md125   249M  12M 238M  5% /boot/efi

 

[root@dimebox ~]# cat /proc/mdstat 

Personalities : [raid6] [raid5] [raid4] [raid1] 

md125 : active raid1 nvme5n1p3[5] nvme2n1p3[2] nvme4n1p3[4] nvme3n1p3[3] nvme0n1p3[0] nvme1n1p3[1]

   254912 blocks super 1.0 [6/6] [UUUUUU]

   bitmap: 0/1 pages [0KB], 65536KB chunk

 

md126 : active raid6 nvme3n1p2[3] nvme1n1p2[1] nvme5n1p2[5] nvme4n1p2[4] nvme0n1p2[0] nvme2n1p2[2]

   16822272 blocks super 1.2 level 6, 512k chunk, algorithm 2 [6/6] [UUUUUU]

    

md127 : active raid6 nvme1n1p1[1] nvme3n1p1[3] nvme5n1p1[5] nvme4n1p1[4] nvme0n1p1[0] nvme2n1p1[2]

   7516188672 blocks super 1.2 level 6, 512k chunk, algorithm 2 [6/6] [UUUUUU]

   bitmap: 8/14 pages [32KB], 65536KB chunk

 

unused devices: <none>

 

As you can see, I also overprovisioned the drives, leaving approximately 7GB free on each:

 

n[root@dimebox ~]# parted /dev/nvme0n1 unit MB print

Model: NVMe Device (nvme)

Disk /dev/nvme0n1: 2000399MB

Sector size (logical/physical): 512B/512B

Partition Table: gpt

Disk Flags:

 

Number Start   End    Size    File system Name Flags

 1   1.05MB   1924281MB 1924280MB           raid

 2   1924281MB 1928592MB 4312MB            raid

 3   1928592MB 1928853MB 261MB   fat16       raid

 

In the attached nvme.txt, you can see the output of isdct show -a -intelssd. nvme2.txt contains the kernel's ring buffer filtered on nvme entries from last boot. What I find most interesting is that not every "timeout, aborting" entry will trigger the reset of the controller. I am also not certain if these timeout entries are noticed by the user, but the "timeout, reset controller" issues are.

 

Does Intel have any idea what could be triggering these events, more importantly, how to avoid them?

0 Kudos
13 Replies
JosafathB_Intel
Moderator
319 Views

Hello Pete H. Thank you for contacting Intel® Technical Support. As we understand, you need assistance with your Intel® SSD DC P4510 Series. If we infer correctly we will appreciate if you can provide us with the following information: • A copy of the SMART logs of your SSD using Intel® SSD DCT with the following command. isdct show [-all|-a] For further information on how to use this command please visit the Intel® Solid State Drive Data Center Tool User Guide (https://www.intel.com/content/dam/support/us/en/documents/memory-and-storage/Intel_SSD_DCT_3_0_x_Use...)page 25 section 2.1.3. • The SSU logs. 1- Go to https://downloadcenter.intel.com/download/26735/ and download the software. 2- When finished downloading it, open it. 3- Attach the file obtained to your reply. We will be looking forward to your reply. Have a nice day. Best regards, Josh B. Intel® Customer Support Technician Under Contract to Intel Corporation
Pete_H_
Beginner
319 Views

Josh,

 

In the first message, I included nvme.txt, that should be the information you are looking for.

 

-Pete

Pete_H_
Beginner
319 Views

And as a follow up, I also have another system, almost exactly like the first but without a RAID 6 configuration (which is using EXT4). The single drive system is using XFS. I can create the issue even easier. I have a directory with 913876 files:

 

[root@dimebuild home]# ls

here2 here22 here2.tar.gz here3 here4 here.tar.gz

[root@dimebuild home]# find . | wc -l

913876

 

I attempt to make an archive using parallel gzip:

 

tar cf - ./here2 ./here3 ./here4 ./here.tar.gz here22 | pigz > here2.tar.gz

 

The controller resets quite predictably. See the attached dmi.txt (dmidecode output) and lspci.txt (lspci -v -v -v -v output).

 

-Pete

 

Pete_H_
Beginner
319 Views

lspci.txt attached here.

Pete_H_
Beginner
319 Views

SSU logs attached.

JosafathB_Intel
Moderator
319 Views

Hello Pete H. Thank you for your reply. We are going to be working on reviewing the information you shared with us and in trying to reproduce the issue you are reporting in our lab. We will be contacting you back as soon as we have an update or in case that further information is required. Have a nice day. Best regards, Josh B. Intel® Customer Support Technician Under Contract to Intel Corporation
Pete_H_
Beginner
319 Views

Josh,

 

Just to update, this appears to be function of random read I/O. To verify:

 

cat /dev/zero > myfile

cat /dev/nvme0n1 > myfile

 

Neither of the above experienced any timeouts nor controller resets, they are also not random reads. In my tests, I created 300GB files with both commands without issue.

 

tar cf file.tar ./<some directories with 2.2 million files>

 

I can get the timeout/controller reset to occur multiple times within a short interval (1 to 10 minutes) .

 

-Pete

 

JosafathB_Intel
Moderator
319 Views

Hello Pete H. Thank you for your reply. Based on the logs you shared we noticed the following: That your system is a Super Micro* model H11DSU-iN, this been said based on your original equipment manufacturer (OEM) website: • Based on the “System HDD / SSD [H11DSU-iN]” list of compatible hardware your “INTEL SSDPE2KX020T8” is not listed as tested, validated or compatible with your server system. https://www.supermicro.com/support/resources/HDD/systemHDD.cfm?ProductID=85751&forMB=true • Based on the logs you shared with us we noticed that you are running “CentOS Linux release 7.6.1810 (Core)” your server system is tested and validated to work with CentOS 7.3 as stated on the OS Compatibility list available on your OEM website. https://www.supermicro.com/Aplus/support/resources/OS/OS_Comp_EPYC7000.cfm • We tried to reproduce your issue in our lab but we did not experience the same issue you are reporting using CentOS 7. • We advise you to open a ticket in parallel with your original equipment manufacturer Super Micro * in your case in order to check if there is any hardware known issue/limitation that could be related to the situation you are experiencing with your current configuration. We will be looking forward to your reply. Have a nice day. Best regards, Josh B. Intel® Customer Support Technician Under Contract to Intel Corporation
Pete_H_
Beginner
319 Views

Josh,

 

Thank you for the detective work. The system was 100% assembled by SuperMicro other than the OS installation so I will kick this back to them.

 

-Pete

JosafathB_Intel
Moderator
319 Views

Hello Pete H, Thank you for your reply. We will be looking forward to your reply letting us know the recommendations provided by your OEM. If you have future questions, please don’t hesitate to contact us. We will be more than happy to help you in any way we can. Best regards, Josh B. Intel® Customer Support Technician Under Contract to Intel Corporation
JosafathB_Intel
Moderator
319 Views

Hello Pete H, Thank you for having contacted Intel® Technical Support. I was reviewing your community post and we would like to know if you need further assistance or if we can close this case. We will be looking forward to your reply. Best regards, Josh B. Intel® Customer Support Technician Under Contract to Intel Corporation
Pete_H_
Beginner
319 Views

Josh,

 

As it turns out, the server was incorrectly configured. Once the proper amount of RAM was installed, the IRQ timeout errors have gone away. Please feel free to close this case, thank you.

 

-Pete

 

JosafathB_Intel
Moderator
319 Views

Hello Pete H, Thank you for your reply. It has been a pleasure to assist you through this process and as per your consent, this case is now close if you need further assistance please do not hesitate to contact us again. Best regards, Josh B. Intel Customer Support Technician Under Contract to Intel Corporation
Reply