Intel® Optane™ Solid State Drives
Support for Issues Related to Solid State Drives based on Intel® Optane™ technology, Intel® MAS and Firmware Update Tool
438 Discussions

P5800X \ PHAL135400VK400BGN - errors with Optane - Urgent

AvivGraupen
Employee
1,604 Views

Hi All,

 

How are you ? I hope this is the right location..

 

Below is an  issue that we see with NVMe erros with Optane model type: PHAL135400VK400BGN - FW rev  L0310100. (Yet, in my opinion the latest Fw. Is L0310200 for this P5800X)

 

Is anyone saw a similar issue to the below and can advise ?

 

  1. In the kernel log, we see the following just before the system hangs and reboots.

As you can see, the driver lost contact (timeout) with the NVMe controller.

 

4:23:07 nblab37 kernel: [ 9472.111963] nvme nvme10: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff

Aug 11 14:23:54 nblab37 kernel: [ 9506.768941] watchdog: BUG: soft lockup - CPU#32 stuck for 22s! [kworker/32:0:65627]

Aug 11 14:23:54 nblab37 kernel: [ 9515.090600] nvme nvme9: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff

Aug 11 14:23:54 nblab37 kernel: [ 9518.848748] Modules linked in: rpcsec_gss_krb5 auth_rpcgss nfsv4 nfs lockd grace fscache binfmt_misc nls_iso8859_1 dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua intel_rapl_msr intel_rapl_common x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel ipmi_ssif kvm input_leds joydev rndis_host cdc_ether usbnet mii isst_if_mbox_pci nbimpu(O) isst_if_mmio isst_if_common mei_me mei ioatdma wmi ipmi_si ipmi_devintf ipmi_msghandler acpi_pad acpi_power_meter mac_hid sch_fq_codel sunrpc msr ip_tables x_tables autofs4 btrfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx libcrc32c xor raid6_pq raid1 raid0 multipath linear hid_generic usbhid hid ast i2c_algo_bit drm_vram_helper ttm drm_kms_helper syscopyarea crct10dif_pclmul sysfillrect crc32_pclmul sysimgblt fb_sys_fops ghash_clmulni_intel drm aesni_intel ixgbe glue_helper crypto_simd cryptd nvme mdio dca nvme_core ahci i2c_i801 libahci [last unloaded: diag_slim_drv]

Aug 11 14:23:59 nblab37 kernel: [ 9521.264707] CPU: 32 PID: 65627 Comm: kworker/32:0 Tainted: G           OE     5.4.0 #3

Aug 11 14:23:59 nblab37 kernel: [ 9521.264707] Hardware name: Supermicro SYS-220U-TNR/X12DPU-6, BIOS 1.1 08/12/2021

Aug 11 14:23:59 nblab37 kernel: [ 9521.264715] Workqueue: events psi_avgs_work

Aug 11 14:23:59 nblab37 kernel: [ 9521.264722] RIP: 0010:_raw_spin_unlock_irqrestore+0x15/0x20

Aug 11 14:23:59 nblab37 kernel: [ 9521.264724] Code: ff 7f 5b 44 89 f0 41 5c 41 5d 41 5e 41 5f 5d c3 90 90 90 90 90 0f 1f 44 00 00 55 48 89 e5 c6 07 00 0f 1f 40 00 48 89 f7 57 9d <0f> 1f 44 00 00 5d c3 0f 1f 40 00 0f 1f 44 00 00 55 49 89 f8 b8 00

Aug 11 14:23:59 nblab37 kernel: [ 9521.264725] RSP: 0018:ffffaba18d33cc80 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13

Aug 11 14:23:59 nblab37 kernel: [ 9521.264726] RAX: 0000000000000001 RBX: ffff9941f143eec0 RCX: 0000000000000000

Aug 11 14:23:59 nblab37 kernel: [ 9521.264727] RDX: ffff9941f143eec8 RSI: 0000000000000246 RDI: 0000000000000246

Aug 11 14:23:59 nblab37 kernel: [ 9521.264727] RBP: ffffaba18d33cc80 R08: ffff9941f5ecc660 R09: 000000000002aa00

Aug 11 14:23:59 nblab37 kernel: [ 9521.264728] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000001

Aug 11 14:23:59 nblab37 kernel: [ 9521.264728] R13: 0000000000000246 R14: 0000000000000001 R15: 0000000000000001

Aug 11 14:23:59 nblab37 kernel: [ 9522.070031] FS:  0000000000000000(0000) GS:ffff9961fec00000(0000) knlGS:0000000000000000

Aug 11 14:23:59 nblab37 kernel: [ 9522.070033] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033

Aug 11 14:23:59 nblab37 kernel: [ 9522.070035] CR2: 000000059669e000 CR3: 0000001e55e84001 CR4: 0000000000760ee0

Aug 11 14:23:59 nblab37 kernel: [ 9522.070036] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000

Aug 11 14:23:59 nblab37 kernel: [ 9522.070037] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400

Aug 11 14:23:59 nblab37 kernel: [ 9522.070037] PKRU: 55555554

Aug 11 14:23:59 nblab37 kernel: [ 9522.070038] Call Trace:

Aug 11 14:23:59 nblab37 kernel: [ 9522.070040]  <IRQ>

Aug 11 14:23:59 nblab37 kernel: [ 9522.070045]  __wake_up_common_lock+0x8a/0xc0

Aug 11 14:23:59 nblab37 kernel: [ 9522.070047]  __wake_up_sync_key+0x1e/0x30

Aug 11 14:23:59 nblab37 kernel: [ 9522.875353]  sock_def_readable+0x40/0x70

Aug 11 14:23:59 nblab37 kernel: [ 9522.875357]  __netlink_sendskb+0x42/0x50

Aug 11 14:23:59 nblab37 kernel: [ 9522.875360]  netlink_broadcast_filtered+0x332/0x3e0

Aug 11 14:23:59 nblab37 kernel: [ 9522.875361]  nlmsg_notify+0xc9/0xe0

Aug 11 14:23:59 nblab37 kernel: [ 9522.875364]  ? smp_irq_move_cleanup_interrupt+0xcb/0xd2

Aug 11 14:23:59 nblab37 kernel: [ 9522.875368]  rtnl_notify+0x34/0x40

Aug 11 14:23:59 nblab37 kernel: [ 9522.875371]  __neigh_notify+0x86/0xd0

Aug 11 14:23:59 nblab37 kernel: [ 9522.875373]  ? neigh_periodic_work+0x220/0x220

Aug 11 14:23:59 nblab37 kernel: [ 9522.875375]  neigh_timer_handler+0xaa/0x280

Aug 11 14:23:59 nblab37 kernel: [ 9522.875377]  call_timer_fn+0x32/0x130

Aug 11 14:23:59 nblab37 kernel: [ 9522.875378]  __run_timers.part.0+0x180/0x280

Aug 11 14:23:59 nblab37 kernel: [ 9523.680682]  run_timer_softirq+0x2a/0x50

Aug 11 14:23:59 nblab37 kernel: [ 9523.680685]  __do_softirq+0xd1/0x2c1

Aug 11 14:23:59 nblab37 kernel: [ 9523.680689]  irq_exit+0xae/0xb0

Aug 11 14:23:59 nblab37 kernel: [ 9523.680690]  smp_apic_timer_interrupt+0x7b/0x140

Aug 11 14:23:59 nblab37 kernel: [ 9523.680692]  apic_timer_interrupt+0xf/0x20

Aug 11 14:23:59 nblab37 kernel: [ 9523.680693]  </IRQ>

Aug 11 14:23:59 nblab37 kernel: [ 9523.680694] RIP: 0010:mutex_lock+0x0/0x40

Aug 11 14:23:59 nblab37 kernel: [ 9523.680695] Code: a6 4d 62 ff 66 0f 1f 44 00 00 0f 1f 44 00 00 55 be 02 00 00 00 48 89 e5 e8 fd fa ff ff 5d c3 66 66 2e 0f 1f 84 00 00 00 00 00 <0f> 1f 44 00 00 55 48 89 e5 41 54 49 89 fc e8 8d de ff ff 31 c0 65

Aug 11 14:23:59 nblab37 kernel: [ 9523.680696] RSP: 0018:ffffaba1a2a5be28 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13

Aug 11 14:23:59 nblab37 kernel: [ 9523.680698] RAX: 0000000000000000 RBX: ffff9941f67e9428 RCX: ffff9940570e9430

Aug 11 14:23:59 nblab37 kernel: [ 9523.680698] RDX: 0000000000000001 RSI: ffff9941fec210b0 RDI: ffff9941f67e93c8

Aug 11 14:23:59 nblab37 kernel: [ 9523.680699] RBP: ffffaba1a2a5be60 R08: 000073746e657665 R09: 8080808080808080

Aug 11 14:23:59 nblab37 kernel: [ 9523.680700] R10: ffff99605a2cef6c R11: 0000000000000018 R12: ffff9941f67e9428

Aug 11 14:23:59 nblab37 kernel: [ 9523.680700] R13: ffff9941f67e93c8 R14: 0000000000000000 R15: ffff99605a2cef00

Aug 11 14:23:59 nblab37 kernel: [ 9523.680703]  ? psi_avgs_work+0x32/0xd0

Aug 11 14:23:59 nblab37 kernel: [ 9523.680705]  process_one_work+0x1eb/0x3b0

Aug 11 14:23:59 nblab37 kernel: [ 9523.680707]  worker_thread+0x4d/0x400

Aug 11 14:23:59 nblab37 kernel: [ 9524.486013]  kthread+0x104/0x140

Aug 11 14:23:59 nblab37 kernel: [ 9524.486015]  ? process_one_work+0x3b0/0x3b0

Aug 11 14:23:59 nblab37 kernel: [ 9524.486016]  ? kthread_park+0x90/0x90

Aug 11 14:23:59 nblab37 kernel: [ 9524.486017]  ret_from_fork+0x1f/0x40

 

  1. There are also errors in the NVMe error-log (added below)

 

nvme error-log /deb/nvme10

Error Log Entries for device:nvme10 entries:64
.................
Entry[ 0]
.................
error_count  : 1
sqid         : 129
cmdid        : 0xffff
status_field : 0xc00c(INTERNAL: The command was not completed successfully due to an internal error)
parm_err_loc : 0xffff
lba          : 0
nsid         : 0xffffffff
vs           : 0
cs           : 0

 

  1. The SSD model of the Xiphos is PHAL135400VK400BGN - FW rev  L0310100

 

 

nblab62:~> sudo nvme list

Node             SN                   Model                                    Namespace Usage                      Format           FW Rev

---------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------

/dev/nvme0n1     PHAL135400VK400BGN   INTEL SSDPF21Q400GB                      1         400.09  GB / 400.09  GB      4 KiB +  0 B   L0310100

/dev/nvme10n1    PHAL11110060400AGN   INTEL SSDPF21Q400GB                      1         400.09  GB / 400.09  GB      4 KiB +  0 B   L0310100

/dev/nvme11n1    PHAL135400VT400BGN   INTEL SSDPF21Q400GB                      1         400.09  GB / 400.09  GB      4 KiB +  0 B   L0310100

/dev/nvme12n1    PHAL135401LT400BGN   INTEL SSDPF21Q400GB                      1         400.09  GB / 400.09  GB      4 KiB +  0 B   L0310100

/dev/nvme13n1    81N0A0C5T5M8         KCD6XLUL960G                             1         571.93  GB / 960.20  GB    512   B +  0 B   0106

/dev/nvme14n1    81N0A0BQT5M8         KCD6XLUL960G                             1         167.47  GB / 960.20  GB    512   B +  0 B   0106

/dev/nvme15n1    S546NE0R600370       SAMSUNG MZWLJ7T6HALA-00007               1           7.68  TB /   7.68  TB    512   B +  0 B   EPK9AB5Q

/dev/nvme16n1    PHAL135400Q7400BGN   INTEL SSDPF21Q400GB                      1         400.09  GB / 400.09  GB      4 KiB +  0 B   L0310100

/dev/nvme17n1    PHAL135400PS400BGN   INTEL SSDPF21Q400GB                      1         400.09  GB / 400.09  GB      4 KiB +  0 B   L0310100

/dev/nvme18n1    PHAL135400N3400BGN   INTEL SSDPF21Q400GB                      1         400.09  GB / 400.09  GB      4 KiB +  0 B   L0310100

/dev/nvme1n1     PHAL135400VZ400BGN   INTEL SSDPF21Q400GB                      1         400.09  GB / 400.09  GB      4 KiB +  0 B   L0310100

/dev/nvme2n1     PHAL135400S5400BGN   INTEL SSDPF21Q400GB                      1         400.09  GB / 400.09  GB      4 KiB +  0 B   L0310100

/dev/nvme3n1     PHAL135401NB400BGN   INTEL SSDPF21Q400GB                      1         400.09  GB / 400.09  GB      4 KiB +  0 B   L0310100

/dev/nvme4n1     PHAL135401P8400BGN   INTEL SSDPF21Q400GB                      1         400.09  GB / 400.09  GB      4 KiB +  0 B   L0310100

/dev/nvme5n1     PHAL135400KS400BGN   INTEL SSDPF21Q400GB                      1         400.09  GB / 400.09  GB      4 KiB +  0 B   L0310100

/dev/nvme6n1     PHAL135400MD400BGN   INTEL SSDPF21Q400GB                      1         400.09  GB / 400.09  GB      4 KiB +  0 B   L0310100

/dev/nvme7n1     PHAL135400RF400BGN   INTEL SSDPF21Q400GB                      1         400.09  GB / 400.09  GB      4 KiB +  0 B   L0310100

/dev/nvme8n1     PHAL135400NQ400BGN   INTEL SSDPF21Q400GB                      1         400.09  GB / 400.09  GB      4 KiB +  0 B   L0310100

/dev/nvme9n1     PHAL135400U7400BGN   INTEL SSDPF21Q400GB                      1         400.09  GB / 400.09  GB      4 KiB +  0 B   L0310100

 

Thanks,
Aviv G.

Labels (1)
0 Kudos
1 Solution
AvivGraupen
Employee
1,469 Views

Hi All, 

 

No crash of kernel has been reported by the end customer -  so it seems updating the Fw. (=L0310200) solve the issue. 

If change I will let you know.

Thank you.
Aviv G

View solution in original post

0 Kudos
10 Replies
BrusC_Intel
Employee
1,584 Views

Hello, AvivGraupen.


Thank you for posting on the Intel Community Support Forum.


We received your ticket regarding this particular error with the Optane SSD, and I will be reviewing this with you.


There are no details about this exact problem, but it is always recommended to make sure the drive is running the latest version.


1. Can you update the firmware using the Intel Memory and Storage Tool CLI and check if the error persists?


If the error persists, please provide us the following reports using the Intel Memory and Storage Tool CLI:

Download


Commands:

  • "intelmas show -intelssd": Displays all the Intel drives connected and each index.
  • "intelmas load -intelSSD X": Updates the firmware of the drive, just replace the X with the drive index.

 

2. If the error persists, you can follow the instructions found in this article to generate the "SMART", "Health", and "Show All" reports.:

How to Get the SMART Attributes of an Intel® SSD using the Intel® Memory and Storage Tool GUI and CLI

  • "intelmas show -smart -intelssd X": Just replace the X with the correct drive index.
  • "intelmas show -a -intelssd X": Displays all the drive details.
  • "intelmas show -nvmelog SmartHealthInfo -intelssd X"


3. Has this been tested in other systems? or other system versions or distributions?


4. How is the drive exactly connected to the system?


I will follow up on August 19th in case additional time is required.


Regards,


Bruce C.

Intel Customer Support Technician


0 Kudos
BrusC_Intel
Employee
1,568 Views

Hello, AvivGraupen.


I wanted to follow up on your thread in case you had any questions regarding my previous message.


I will follow up again on August 24th in case additional time is required.


Regards,


Bruce C.

Intel Customer Support Technician


0 Kudos
AvivGraupen
Employee
1,552 Views

Hi, 
Thank you at the moment - we got form Optane SSD Eng. team a new Fw. to end customer to update (=L0310200) and a new BIOS 1.4 for SuperMicro server - we will update with results once we have them.


Thanks,
Aviv G. 

0 Kudos
BrusC_Intel
Employee
1,546 Views

Hello, AvivGraupen.


Thank you for letting me know.


I hope you encounter no problems with the new firmware provided


I will keep the thread open and will follow up on August 25th just in case.


Regards,


Bruce C.

Intel Customer Support Technician


0 Kudos
BrusC_Intel
Employee
1,532 Views

Hello, AvivGraupen.


This post is to follow up on the status of your thread and check if everything is working fine.


I will follow up again on August 30th o provide additional time.


Regards,


Bruce C.

Intel Customer Support Technician


0 Kudos
AvivGraupen
Employee
1,520 Views

Hi All, 

 

Last Sunday NB (=end customer) tested Fw. L0310200 for Optane SSD p5800X (without an update to the SuperMicro BIOS server) - so far no crash of kernel has been reported by them, so it seems updating the Fw. solve the issue. 

If any issues will appear I will update again.

Thank you.
Aviv G. 

0 Kudos
BrusC_Intel
Employee
1,514 Views

Hello, AvivGraupen.


Good day,


Thank you for letting us know.


I'm glad to hear that everything has been working fine so far.


I will follow up on August 31th just in case.


Regards,


Bruce C.

Intel Customer Support Technician


0 Kudos
BrusC_Intel
Employee
1,479 Views

Hello, AvivGraupen.


Good day,


This post is just a quick follow up on the status of your thread.


I will follow up again on September 5th to provide additional time in case you want to keep the thread open.


Regards,


Bruce C.

Intel Customer Support Technician


0 Kudos
AvivGraupen
Employee
1,470 Views

Hi All, 

 

No crash of kernel has been reported by the end customer -  so it seems updating the Fw. (=L0310200) solve the issue. 

If change I will let you know.

Thank you.
Aviv G

0 Kudos
BrusC_Intel
Employee
1,448 Views

Hello, AvivGraupen.


Thank you for letting us know, I'm glad to hear that no more issues showed up.


Since that is the case, the thread will be closed right now and no longer monitored by Intel support, but If you require any type of assistance from Intel in the future, please open a new thread and reference this one, or contact us using any of the available support methods:

- https://www.intel.com/content/www/us/en/support/contact-intel.html


Regards,


Bruce C.

Intel Customer Support Technician


0 Kudos
Reply