Ethernet Products
Determine ramifications of Intel® Ethernet products and technologies
5148 Discussions

ICE kernel module starvation with DPDK and RT kernel

fedecive
New Contributor I
2,113 Views

Dear Community,

I am working with an application that requires DPDK processes hooked to 2 cores isolated from the OS system.
On my server, I have an Intel E810 card with 2 SFP+ interfaces used for the application. The OS is Ubuntu 20.04 with kernel version 6.2.16 patched for RT. ICE driver is 1.12.7 and FW version is 4.40.

The applications also requires SR-IOV virtualization on the interface enp1s0f0np0 with two different VFs connected and the usage of the interface enp1s0f1np1 without SR-IOV.

After around one hour of running, the kernel complains about the following issue:

 

[Thu Jun 20 10:48:46 2024] INFO: task kworker/11:1:307 blocked for more than 606 seconds.
[Thu Jun 20 10:48:46 2024]       Tainted: G        W  O       6.2.16-rt3 #3
[Thu Jun 20 10:48:46 2024] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Thu Jun 20 10:48:46 2024] task:kworker/11:1    state:D stack:0     pid:307   ppid:2      flags:0x00004000
[Thu Jun 20 10:48:46 2024] Workqueue: ice ice_service_task [ice]
[Thu Jun 20 10:48:46 2024] Call Trace:
[Thu Jun 20 10:48:46 2024]  <TASK>
[Thu Jun 20 10:48:46 2024]  __schedule+0x3bb/0x1650
[Thu Jun 20 10:48:46 2024]  ? preempt_schedule_thunk+0x1a/0x20
[Thu Jun 20 10:48:46 2024]  schedule+0x6f/0x120
[Thu Jun 20 10:48:46 2024]  synchronize_irq+0x7c/0xb0
[Thu Jun 20 10:48:46 2024]  ? __pfx_autoremove_wake_function+0x10/0x10
[Thu Jun 20 10:48:46 2024]  ice_vsi_dis_irq+0x17e/0x1a0 [ice]
[Thu Jun 20 10:48:46 2024]  ice_down+0x55/0x2e0 [ice]
[Thu Jun 20 10:48:46 2024]  ice_vsi_close+0xb8/0xc0 [ice]
[Thu Jun 20 10:48:46 2024]  ice_dis_vsi+0x4c/0x80 [ice]
[Thu Jun 20 10:48:46 2024]  ice_pf_dis_all_vsi.constprop.0+0x35/0xf0 [ice]
[Thu Jun 20 10:48:46 2024]  ice_prepare_for_reset+0x1c3/0x420 [ice]
[Thu Jun 20 10:48:46 2024]  ice_do_reset+0x35/0x150 [ice]
[Thu Jun 20 10:48:46 2024]  ice_service_task+0x489/0x19c0 [ice]
[Thu Jun 20 10:48:46 2024]  ? __schedule+0x3c3/0x1650
[Thu Jun 20 10:48:46 2024]  ? psi_avgs_work+0x65/0xd0
[Thu Jun 20 10:48:46 2024]  process_one_work+0x21c/0x490
[Thu Jun 20 10:48:46 2024]  worker_thread+0x54/0x3e0
[Thu Jun 20 10:48:46 2024]  ? __pfx_worker_thread+0x10/0x10
[Thu Jun 20 10:48:46 2024]  kthread+0x11c/0x140
[Thu Jun 20 10:48:46 2024]  ? __pfx_kthread+0x10/0x10
[Thu Jun 20 10:48:46 2024]  ret_from_fork+0x29/0x50
[Thu Jun 20 10:48:46 2024]  </TASK>
[Thu Jun 20 10:48:46 2024] INFO: task kworker/16:1:311 blocked for more than 606 seconds.
[Thu Jun 20 10:48:46 2024]       Tainted: G        W  O       6.2.16-rt3 #3
[Thu Jun 20 10:48:46 2024] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Thu Jun 20 10:48:46 2024] task:kworker/16:1    state:D stack:0     pid:311   ppid:2      flags:0x00004000
[Thu Jun 20 10:48:46 2024] Workqueue: ipv6_addrconf addrconf_verify_work
[Thu Jun 20 10:48:46 2024] Call Trace:
[Thu Jun 20 10:48:46 2024]  <TASK>
[Thu Jun 20 10:48:46 2024]  __schedule+0x3bb/0x1650
[Thu Jun 20 10:48:46 2024]  schedule+0x6f/0x120
[Thu Jun 20 10:48:46 2024]  rt_mutex_slowlock_block.constprop.0+0x3a/0x190
[Thu Jun 20 10:48:46 2024]  __rt_mutex_slowlock.constprop.0+0x83/0x210
[Thu Jun 20 10:48:46 2024]  mutex_lock+0x91/0xb0
[Thu Jun 20 10:48:46 2024]  rtnl_lock+0x19/0x20
[Thu Jun 20 10:48:46 2024]  addrconf_verify_work+0x16/0x40
[Thu Jun 20 10:48:46 2024]  process_one_work+0x21c/0x490
[Thu Jun 20 10:48:46 2024]  worker_thread+0x54/0x3e0
[Thu Jun 20 10:48:46 2024]  ? __pfx_worker_thread+0x10/0x10
[Thu Jun 20 10:48:46 2024]  kthread+0x11c/0x140
[Thu Jun 20 10:48:46 2024]  ? __pfx_kthread+0x10/0x10
[Thu Jun 20 10:48:46 2024]  ret_from_fork+0x29/0x50
[Thu Jun 20 10:48:46 2024]  </TASK>
[Thu Jun 20 10:48:46 2024] INFO: task kworker/23:1:317 blocked for more than 606 seconds.
[Thu Jun 20 10:48:46 2024]       Tainted: G        W  O       6.2.16-rt3 #3
[Thu Jun 20 10:48:46 2024] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Thu Jun 20 10:48:46 2024] task:kworker/23:1    state:D stack:0     pid:317   ppid:2      flags:0x00004000
[Thu Jun 20 10:48:46 2024] Workqueue: ipv6_addrconf addrconf_verify_work
[Thu Jun 20 10:48:46 2024] Call Trace:
[Thu Jun 20 10:48:46 2024]  <TASK>
[Thu Jun 20 10:48:46 2024]  __schedule+0x3bb/0x1650
[Thu Jun 20 10:48:46 2024]  ? update_load_avg+0x84/0x840
[Thu Jun 20 10:48:46 2024]  schedule+0x6f/0x120
[Thu Jun 20 10:48:46 2024]  rt_mutex_slowlock_block.constprop.0+0x3a/0x190
[Thu Jun 20 10:48:46 2024]  __rt_mutex_slowlock.constprop.0+0x83/0x210
[Thu Jun 20 10:48:46 2024]  mutex_lock+0x91/0xb0
[Thu Jun 20 10:48:46 2024]  rtnl_lock+0x19/0x20
[Thu Jun 20 10:48:46 2024]  addrconf_verify_work+0x16/0x40
[Thu Jun 20 10:48:46 2024]  process_one_work+0x21c/0x490
[Thu Jun 20 10:48:46 2024]  worker_thread+0x54/0x3e0
[Thu Jun 20 10:48:46 2024]  ? __pfx_worker_thread+0x10/0x10
[Thu Jun 20 10:48:46 2024]  kthread+0x11c/0x140
[Thu Jun 20 10:48:46 2024]  ? __pfx_kthread+0x10/0x10
[Thu Jun 20 10:48:46 2024]  ret_from_fork+0x29/0x50
[Thu Jun 20 10:48:46 2024]  </TASK>
[Thu Jun 20 10:48:46 2024] INFO: task ptp4l:1677 blocked for more than 606 seconds.
[Thu Jun 20 10:48:46 2024]       Tainted: G        W  O       6.2.16-rt3 #3
[Thu Jun 20 10:48:46 2024] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Thu Jun 20 10:48:46 2024] task:ptp4l           state:D stack:0     pid:1677  ppid:1676   flags:0x00000002
[Thu Jun 20 10:48:46 2024] Call Trace:
[Thu Jun 20 10:48:46 2024]  <TASK>
[Thu Jun 20 10:48:46 2024]  __schedule+0x3bb/0x1650
[Thu Jun 20 10:48:46 2024]  ? debug_smp_processor_id+0x1b/0x30
[Thu Jun 20 10:48:46 2024]  ? migrate_enable+0xe7/0x170
[Thu Jun 20 10:48:46 2024]  schedule+0x6f/0x120
[Thu Jun 20 10:48:46 2024]  rt_mutex_slowlock_block.constprop.0+0x3a/0x190
[Thu Jun 20 10:48:46 2024]  __rt_mutex_slowlock.constprop.0+0x83/0x210
[Thu Jun 20 10:48:46 2024]  mutex_lock+0x91/0xb0
[Thu Jun 20 10:48:46 2024]  ? debug_smp_processor_id+0x1b/0x30
[Thu Jun 20 10:48:46 2024]  rtnl_lock+0x19/0x20
[Thu Jun 20 10:48:46 2024]  packet_release+0x12d/0x440
[Thu Jun 20 10:48:46 2024]  ? debug_smp_processor_id+0x1b/0x30
[Thu Jun 20 10:48:46 2024]  __sock_release+0x3f/0xc0
[Thu Jun 20 10:48:46 2024]  sock_close+0x1c/0x30
[Thu Jun 20 10:48:46 2024]  __fput+0x93/0x270
[Thu Jun 20 10:48:46 2024]  ____fput+0x12/0x20
[Thu Jun 20 10:48:46 2024]  task_work_run+0x62/0x90
[Thu Jun 20 10:48:46 2024]  exit_to_user_mode_prepare+0x1fc/0x210
[Thu Jun 20 10:48:46 2024]  syscall_exit_to_user_mode+0x20/0x50
[Thu Jun 20 10:48:46 2024]  do_syscall_64+0x6d/0x90
[Thu Jun 20 10:48:46 2024]  ? syscall_exit_to_user_mode+0x3f/0x50
[Thu Jun 20 10:48:46 2024]  ? do_syscall_64+0x6d/0x90
[Thu Jun 20 10:48:46 2024]  ? do_syscall_64+0x6d/0x90
[Thu Jun 20 10:48:46 2024]  ? __ct_user_enter+0xc1/0x1a0
[Thu Jun 20 10:48:46 2024]  ? syscall_exit_to_user_mode+0x3f/0x50
[Thu Jun 20 10:48:46 2024]  ? do_syscall_64+0x6d/0x90
[Thu Jun 20 10:48:46 2024]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
[Thu Jun 20 10:48:46 2024] RIP: 0033:0x7f679ed38f67
[Thu Jun 20 10:48:46 2024] RSP: 002b:00007ffdf0adc6f8 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
[Thu Jun 20 10:48:46 2024] RAX: 0000000000000000 RBX: 000055d943853df4 RCX: 00007f679ed38f67
[Thu Jun 20 10:48:46 2024] RDX: 000055d941e0f534 RSI: 000055d943853df4 RDI: 000000000000000e
[Thu Jun 20 10:48:46 2024] RBP: 000055d943853dfc R08: 0000000000000001 R09: 0000000000000000
[Thu Jun 20 10:48:46 2024] R10: 0000000000000000 R11: 0000000000000246 R12: 000055d943853e1c
[Thu Jun 20 10:48:46 2024] R13: 000055d943853df4 R14: 000055d941e109a8 R15: 00007ffdf0adc7f0
[Thu Jun 20 10:48:46 2024]  </TASK>
[Thu Jun 20 10:48:46 2024] INFO: task kworker/11:0:3943 blocked for more than 606 seconds.
[Thu Jun 20 10:48:46 2024]       Tainted: G        W  O       6.2.16-rt3 #3
[Thu Jun 20 10:48:46 2024] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Thu Jun 20 10:48:46 2024] task:kworker/11:0    state:D stack:0     pid:3943  ppid:2      flags:0x00004000
[Thu Jun 20 10:48:46 2024] Workqueue: events linkwatch_event
[Thu Jun 20 10:48:46 2024] Call Trace:
[Thu Jun 20 10:48:46 2024]  <TASK>
[Thu Jun 20 10:48:46 2024]  __schedule+0x3bb/0x1650
[Thu Jun 20 10:48:46 2024]  ? raw_spin_rq_unlock+0x1f/0x70
[Thu Jun 20 10:48:46 2024]  ? rt_mutex_setprio+0x18d/0x4a0
[Thu Jun 20 10:48:46 2024]  schedule+0x6f/0x120
[Thu Jun 20 10:48:46 2024]  rt_mutex_slowlock_block.constprop.0+0x3a/0x190
[Thu Jun 20 10:48:46 2024]  __rt_mutex_slowlock.constprop.0+0x83/0x210
[Thu Jun 20 10:48:46 2024]  mutex_lock+0x91/0xb0
[Thu Jun 20 10:48:46 2024]  rtnl_lock+0x19/0x20
[Thu Jun 20 10:48:46 2024]  linkwatch_event+0x12/0x40
[Thu Jun 20 10:48:46 2024]  process_one_work+0x21c/0x490
[Thu Jun 20 10:48:46 2024]  worker_thread+0x54/0x3e0
[Thu Jun 20 10:48:46 2024]  ? __pfx_worker_thread+0x10/0x10
[Thu Jun 20 10:48:46 2024]  kthread+0x11c/0x140
[Thu Jun 20 10:48:46 2024]  ? __pfx_kthread+0x10/0x10
[Thu Jun 20 10:48:46 2024]  ret_from_fork+0x29/0x50
[Thu Jun 20 10:48:46 2024]  </TASK>

 

From the kernel log it seems that the ice module goes to starvation.
Few minutes later, I can also see another problem:

 

[Thu Jun 20 11:22:31 2024] ice 0000:01:00.1 enp1s0f1np1: tx_timeout: VSI_num: 14, Q 23, NTC: 0x5, HW_HEAD: 0x5, NTU: 0x6, INT: 0x0
[Thu Jun 20 11:22:31 2024] ice 0000:01:00.1 enp1s0f1np1: tx_timeout recovery level 3, txqueue 23
[Thu Jun 20 11:22:37 2024] ice 0000:01:00.1 enp1s0f1np1: tx_timeout: VSI_num: 14, Q 23, NTC: 0x5, HW_HEAD: 0x5, NTU: 0x6, INT: 0x0
[Thu Jun 20 11:22:37 2024] ice 0000:01:00.1 enp1s0f1np1: tx_timeout recovery level 4, txqueue 23
[Thu Jun 20 11:22:37 2024] ice 0000:01:00.1 enp1s0f1np1: tx_timeout recovery unsuccessful, device is in unrecoverable state.

 

After 256 attempts, ICE module fail with the following error and reset

 

[Thu Jun 20 11:59:18 2024] ice 0000:01:00.1: Rebuild failed, unload and reload driver

 

The reset does not properly work since the interface enp1s0f1np1 is not detected anymore by the kernel. Indeed, if I issue the command ethtool, it returns an error:

 

~$ sudo ethtool enp1s0f1np1

netlink error: failed to retrieve link settings
netlink error: Input/output error
netlink error: failed to retrieve link settings
netlink error: Input/output error
Settings for enp1s0f1np1:
	Supports Wake-on: d
	Wake-on: d
        Current message level: 0x00000007 (7)
                               drv probe link
	Link detected: no

 


What is going on here, do you have any idea on how to resolve the issue?

Thank you

23 Replies
IntelSupport
Community Manager
1,877 Views

Hello fedecive,


Greetings of the day!


Thank you for choosing Intel support. Before we proceed further, could you please elaborate on the issue that you are facing? 

Also, please confirm if the E810 is a controller or an adapter, and provide us with the system details.


Kindly refer to the link below to access details about the controller's overview: https://edc.intel.com/content/www/us/en/design/products/ethernet/config-guide-e810-dpdk/dpdk-overview/


Regards,

Amina

Intel Server Support


0 Kudos
IntelSupport
Community Manager
1,830 Views

Hello fedecive,  

 

Greetings for the day!  

 

I hope this message finds you well.  

 

We are following up to find out if you were able to find the information we provided. Please reply to confirm, so we can continue helping on a resolution. Looking forward to receiving your reply.  

 

Regards,  

Amina


0 Kudos
fedecive
New Contributor I
1,826 Views

Hi Amina,

thanks for the reply and sorry for the delay. I did not receive any notification about your message.

 

The NIC card is an adapter (PN, E810-XXVDA4TGG1) with GPS, Galileo, GLONASS support to also serve as PTP grandmaster. The OS is Ubuntu 22.04 while DPDK version is 20.11.9. I am using the board to interface with an ORAN radio, thus DPDK supports the XRAN library.

Please let me know if you need further information.

 

Thank you.

0 Kudos
IntelSupport
Community Manager
1,798 Views

Hello Fedecive,

Greetings of the day!

Thank you for your response, Kindly refer to the link below and let us know if this helps you. If you need more details, we will check it internally.

 https://edc.intel.com/content/www/us/en/design/products/ethernet/config-guide-e810-dpdk/dpdk-overview/

Regards,

Amina


0 Kudos
fedecive
New Contributor I
1,782 Views

Hi Amina,

 

thanks for the response.

 

The known issue section does not report a problem similar to what I am experiencing. Of course, I have done all the steps in the DPDK Installation and Configuration. The issue I have posted appears after a few dozens of minutes and it seems starting with a starvation state of the ICE kernel module. Then, there is a clear malfunction of the ICE kernel module that prevents to reidentify the ethernet port.

As indicated in section 3.5, 3.6 and 3.7 of the guide, the core are isolated and the tickless kernel is considered to improve the performance. How is it possible that the kernel module goes to starvation if the kernel is separated from DPDK processing? In the user space, I do not have any process that could potentially send the ICE module in starvation.

Could you please provide me with more details on this?

Thank you.

0 Kudos
Azeem_Intel
Employee
1,746 Views

Hello fedecive,


Greetings for the day!


Thank you for your response. Please allow us some time while we check internally.



Best regards,

Azeem_Intel


0 Kudos
IntelSupport
Community Manager
1,713 Views

Hello fedecive,


Greetings for the day!


Thank you for allowing us some time to check internally. Before we proceed further, I would like to confirm a few things with you.


What is the frequency of the event occurrence? (e.g., once an hour, daily, always?)

How many systems are affected by this issue?

Is your network adapter built directly onto your PC's motherboard?


Regards,

Amina



0 Kudos
fedecive
New Contributor I
1,703 Views

Hi Amina,

 

What is the frequency of the event occurrence? (e.g., once an hour, daily, always?)

So far, I would say once a hour.

 

How many systems are affected by this issue?

I have only one setup deployed.

 

Is your network adapter built directly onto your PC's motherboard?

Yes, it is. The motherboard is an Asus Prime X670-P and the Intel E810 is directly connected to a PCIe slot.

 

Thank you.

0 Kudos
IntelSupport
Community Manager
1,697 Views

Hello fedecive,


Greetings for the day!


Thank you for your response, kindly allow us some time while we check internally.


Regards,

Amina



0 Kudos
IntelSupport
Community Manager
1,640 Views

Hello Fedecive,


Greetings for the day!


Thank you for your patience. After checking with our team internally, we recommend that you try a compatible driver/FW/DPDK combination. If possible, we suggest updating to the latest compatible versions available. You can find a compatibility table in the E810 Feature Support Matrix. Below is the link for your reference:


https://www.intel.com/content/www/us/en/content-details/630155/intel-ethernet-controller-e810-feature-support-matrix.html


Regards,

Amina



0 Kudos
IntelSupport
Community Manager
1,539 Views

Hello Fedecive,


Thank you for contacting Intel.


This is the second follow-up regarding the reported issue. We're eager to ensure a swift resolution and would appreciate any updates or additional information you can provide.


Please feel free to respond to this email at your earliest convenience.


Regards,

Amina 


0 Kudos
fedecive
New Contributor I
1,530 Views

Hi Amina,

 

we are trying to update DPDK to version 23.07 (probably the best match according to the settings we have and the compatibility matrix you shared). I will give you updates the the forthcoming days.

 

Thank you.

0 Kudos
IntelSupport
Community Manager
1,525 Views

Hello Fedecive,


Greetings from Intel!


Thank you for the update, we will wait for your response.


Regards,

Amina



0 Kudos
fedecive
New Contributor I
1,431 Views

Hi Amina,

 

I updated DPDK to version 23.07, the closest to our setup.
The issue persists and the frequency of the event occurrence is reduced. Now I can see the problem after few minutes.

Here is the kernel log:

[Wed Jul  3 10:00:29 2024] ------------[ cut here ]------------
[Wed Jul  3 10:00:29 2024] NETDEV WATCHDOG: enp1s0f0np0 (ice): transmit queue 23 timed out
[Wed Jul  3 10:00:29 2024] WARNING: CPU: 12 PID: 143 at net/sched/sch_generic.c:525 dev_watchdog+0x245/0x250
[Wed Jul  3 10:00:29 2024] Modules linked in: sctp ip6_udp_tunnel udp_tunnel msr iavf(O) intel_rapl_msr intel_rapl_common edac_mce_amd kvm_amd binfmt_misc kvm nls_iso8859_1 snd_hda_codec_realtek snd_hda_codec_generic snd_hda_codec_hdmi eeepc_wmi asus_wmi platform_profile ledtrig_audio i40e rapl sparse_keymap input_leds snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi snd_hda_codec ib_uverbs wmi_bmof snd_hda_core snd_hwdep snd_pcm joydev k10temp snd_timer snd ccp soundcore ib_core mac_hid dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua sch_fq_codel efi_pstore ip_tables x_tables autofs4 btrfs blake2b_generic raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear amdgpu hid_generic iommu_v2 drm_buddy gpu_sched i2c_algo_bit drm_ttm_helper ttm drm_display_helper cec rc_core usbhid drm_kms_helper hid syscopyarea sysfillrect sysimgblt ice(O) crct10dif_pclmul crc32_pclmul ghash_clmulni_intel sha512_ssse3 r8169 drm aesni_intel
[Wed Jul  3 10:00:29 2024]  crypto_simd cryptd nvme ahci nvme_core i2c_piix4 xhci_pci realtek libahci xhci_pci_renesas gnss video wmi gpio_amdpt gpio_generic
[Wed Jul  3 10:00:29 2024] CPU: 12 PID: 143 Comm: ktimers/12 Tainted: G           O       6.2.16-rt3 #3
[Wed Jul  3 10:00:29 2024] Hardware name: ASUS System Product Name/PRIME X670-P, BIOS 1222 02/24/2023
[Wed Jul  3 10:00:29 2024] RIP: 0010:dev_watchdog+0x245/0x250
[Wed Jul  3 10:00:29 2024] Code: ff e9 02 ff ff ff 4c 89 e7 c6 05 e6 40 44 01 01 e8 70 3a f8 ff 44 89 f1 4c 89 e6 48 c7 c7 68 01 60 95 48 89 c2 e8 7b a6 35 ff <0f> 0b e9 f3 fe ff ff 0f 1f 40 00 90 90 90 90 90 90 90 90 90 90 90
[Wed Jul  3 10:00:29 2024] RSP: 0018:ffffbcefc0657d48 EFLAGS: 00010286
[Wed Jul  3 10:00:29 2024] RAX: 0000000000000000 RBX: ffff92dee2f354e8 RCX: 0000000000000000
[Wed Jul  3 10:00:29 2024] RDX: 0000000000000001 RSI: ffffffff954e9d61 RDI: 00000000ffffffff
[Wed Jul  3 10:00:29 2024] RBP: ffffbcefc0657d70 R08: 000000000000003f R09: 0000000000000000
[Wed Jul  3 10:00:29 2024] R10: 0000000000ffff0a R11: 0000000000000001 R12: ffff92dee2f35000
[Wed Jul  3 10:00:29 2024] R13: ffff92dee2f35420 R14: 0000000000000017 R15: 0000000000000000
[Wed Jul  3 10:00:29 2024] FS:  0000000000000000(0000) GS:ffff92e587700000(0000) knlGS:0000000000000000
[Wed Jul  3 10:00:29 2024] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[Wed Jul  3 10:00:29 2024] CR2: 00007fc7478af010 CR3: 00000008495f8000 CR4: 0000000000750ee0
[Wed Jul  3 10:00:29 2024] PKRU: 55555554
[Wed Jul  3 10:00:29 2024] Call Trace:
[Wed Jul  3 10:00:29 2024]  <TASK>
[Wed Jul  3 10:00:29 2024]  ? __pfx_dev_watchdog+0x10/0x10
[Wed Jul  3 10:00:29 2024]  call_timer_fn+0x29/0x1b0
[Wed Jul  3 10:00:29 2024]  ? __pfx_dev_watchdog+0x10/0x10
[Wed Jul  3 10:00:29 2024]  __run_timers.part.0+0x24d/0x370
[Wed Jul  3 10:00:29 2024]  run_timer_softirq+0x43/0xb0
[Wed Jul  3 10:00:29 2024]  __do_softirq+0xf6/0x36f
[Wed Jul  3 10:00:29 2024]  run_timersd+0x67/0xb0
[Wed Jul  3 10:00:29 2024]  smpboot_thread_fn+0x1cf/0x2c0
[Wed Jul  3 10:00:29 2024]  ? __pfx_smpboot_thread_fn+0x10/0x10
[Wed Jul  3 10:00:29 2024]  kthread+0x11c/0x140
[Wed Jul  3 10:00:29 2024]  ? __pfx_kthread+0x10/0x10
[Wed Jul  3 10:00:29 2024]  ret_from_fork+0x29/0x50
[Wed Jul  3 10:00:29 2024]  </TASK>
[Wed Jul  3 10:00:29 2024] ---[ end trace 0000000000000000 ]---
[Wed Jul  3 10:00:29 2024] ice 0000:01:00.0 enp1s0f0np0: tx_timeout: VSI_num: 12, Q 23, NTC: 0x7b4, HW_HEAD: 0x7b7, NTU: 0x7b8, INT: 0x0
[Wed Jul  3 10:00:29 2024] ice 0000:01:00.0 enp1s0f0np0: tx_timeout recovery level 1, txqueue 23
[Wed Jul  3 10:13:11 2024] INFO: task kworker/19:1:314 blocked for more than 606 seconds.
[Wed Jul  3 10:13:11 2024]       Tainted: G        W  O       6.2.16-rt3 #3
[Wed Jul  3 10:13:11 2024] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Wed Jul  3 10:13:11 2024] task:kworker/19:1    state:D stack:0     pid:314   ppid:2      flags:0x00004000
[Wed Jul  3 10:13:11 2024] Workqueue: ipv6_addrconf addrconf_verify_work
[Wed Jul  3 10:13:11 2024] Call Trace:
[Wed Jul  3 10:13:11 2024]  <TASK>
[Wed Jul  3 10:13:11 2024]  __schedule+0x3bb/0x1650
[Wed Jul  3 10:13:11 2024]  schedule+0x6f/0x120
[Wed Jul  3 10:13:11 2024]  rt_mutex_slowlock_block.constprop.0+0x3a/0x190
[Wed Jul  3 10:13:11 2024]  __rt_mutex_slowlock.constprop.0+0x83/0x210
[Wed Jul  3 10:13:11 2024]  mutex_lock+0x91/0xb0
[Wed Jul  3 10:13:11 2024]  rtnl_lock+0x19/0x20
[Wed Jul  3 10:13:11 2024]  addrconf_verify_work+0x16/0x40
[Wed Jul  3 10:13:11 2024]  process_one_work+0x21c/0x490
[Wed Jul  3 10:13:11 2024]  worker_thread+0x54/0x3e0
[Wed Jul  3 10:13:11 2024]  ? __pfx_worker_thread+0x10/0x10
[Wed Jul  3 10:13:11 2024]  kthread+0x11c/0x140
[Wed Jul  3 10:13:11 2024]  ? __pfx_kthread+0x10/0x10
[Wed Jul  3 10:13:11 2024]  ret_from_fork+0x29/0x50
[Wed Jul  3 10:13:11 2024]  </TASK>
[Wed Jul  3 10:13:11 2024] INFO: task kworker/12:2:3601 blocked for more than 606 seconds.
[Wed Jul  3 10:13:11 2024]       Tainted: G        W  O       6.2.16-rt3 #3
[Wed Jul  3 10:13:11 2024] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Wed Jul  3 10:13:11 2024] task:kworker/12:2    state:D stack:0     pid:3601  ppid:2      flags:0x00004000
[Wed Jul  3 10:13:11 2024] Workqueue: events linkwatch_event
[Wed Jul  3 10:13:11 2024] Call Trace:
[Wed Jul  3 10:13:11 2024]  <TASK>
[Wed Jul  3 10:13:11 2024]  __schedule+0x3bb/0x1650
[Wed Jul  3 10:13:11 2024]  ? raw_spin_rq_unlock+0x1f/0x70
[Wed Jul  3 10:13:11 2024]  ? rt_mutex_setprio+0x18d/0x4a0
[Wed Jul  3 10:13:11 2024]  schedule+0x6f/0x120
[Wed Jul  3 10:13:11 2024]  rt_mutex_slowlock_block.constprop.0+0x3a/0x190
[Wed Jul  3 10:13:11 2024]  __rt_mutex_slowlock.constprop.0+0x83/0x210
[Wed Jul  3 10:13:11 2024]  mutex_lock+0x91/0xb0
[Wed Jul  3 10:13:11 2024]  rtnl_lock+0x19/0x20
[Wed Jul  3 10:13:11 2024]  linkwatch_event+0x12/0x40
[Wed Jul  3 10:13:11 2024]  process_one_work+0x21c/0x490
[Wed Jul  3 10:13:11 2024]  worker_thread+0x54/0x3e0
[Wed Jul  3 10:13:11 2024]  ? __pfx_worker_thread+0x10/0x10
[Wed Jul  3 10:13:11 2024]  kthread+0x11c/0x140
[Wed Jul  3 10:13:11 2024]  ? __pfx_kthread+0x10/0x10
[Wed Jul  3 10:13:11 2024]  ret_from_fork+0x29/0x50
[Wed Jul  3 10:13:11 2024]  </TASK>
[Wed Jul  3 10:13:11 2024] INFO: task ptp4l:128865 blocked for more than 606 seconds.
[Wed Jul  3 10:13:11 2024]       Tainted: G        W  O       6.2.16-rt3 #3
[Wed Jul  3 10:13:11 2024] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Wed Jul  3 10:13:11 2024] task:ptp4l           state:D stack:0     pid:128865 ppid:128864 flags:0x00000002
[Wed Jul  3 10:13:11 2024] Call Trace:
[Wed Jul  3 10:13:11 2024]  <TASK>
[Wed Jul  3 10:13:11 2024]  __schedule+0x3bb/0x1650
[Wed Jul  3 10:13:11 2024]  schedule+0x6f/0x120
[Wed Jul  3 10:13:11 2024]  rt_mutex_slowlock_block.constprop.0+0x3a/0x190
[Wed Jul  3 10:13:11 2024]  __rt_mutex_slowlock.constprop.0+0x83/0x210
[Wed Jul  3 10:13:11 2024]  mutex_lock+0x91/0xb0
[Wed Jul  3 10:13:11 2024]  rtnl_lock+0x19/0x20
[Wed Jul  3 10:13:11 2024]  packet_setsockopt+0xb58/0x1200
[Wed Jul  3 10:13:11 2024]  ? __ct_user_enter+0xc1/0x1a0
[Wed Jul  3 10:13:11 2024]  __sys_setsockopt+0xcf/0x1d0
[Wed Jul  3 10:13:11 2024]  __x64_sys_setsockopt+0x23/0x30
[Wed Jul  3 10:13:11 2024]  do_syscall_64+0x5d/0x90
[Wed Jul  3 10:13:11 2024]  ? syscall_exit_to_user_mode+0x3f/0x50
[Wed Jul  3 10:13:11 2024]  ? do_syscall_64+0x6d/0x90
[Wed Jul  3 10:13:11 2024]  ? do_syscall_64+0x6d/0x90
[Wed Jul  3 10:13:11 2024]  ? do_syscall_64+0x6d/0x90
[Wed Jul  3 10:13:11 2024]  ? do_syscall_64+0x6d/0x90
[Wed Jul  3 10:13:11 2024]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
[Wed Jul  3 10:13:11 2024] RIP: 0033:0x7f381278face
[Wed Jul  3 10:13:11 2024] RSP: 002b:00007ffeeec68d08 EFLAGS: 00000246 ORIG_RAX: 0000000000000036
[Wed Jul  3 10:13:11 2024] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f381278face
[Wed Jul  3 10:13:11 2024] RDX: 0000000000000001 RSI: 0000000000000107 RDI: 000000000000000e
[Wed Jul  3 10:13:11 2024] RBP: 0000000000000001 R08: 0000000000000010 R09: 0030706e30663073
[Wed Jul  3 10:13:11 2024] R10: 00007ffeeec68d30 R11: 0000000000000246 R12: 000000000000000e
[Wed Jul  3 10:13:11 2024] R13: 00007ffeeec68d30 R14: 00007ffeeec68ddc R15: 00007ffeeec68dd6
[Wed Jul  3 10:13:11 2024]  </TASK>
[Wed Jul  3 10:13:11 2024] INFO: task kworker/12:1:264428 blocked for more than 606 seconds.
[Wed Jul  3 10:13:11 2024]       Tainted: G        W  O       6.2.16-rt3 #3
[Wed Jul  3 10:13:11 2024] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Wed Jul  3 10:13:11 2024] task:kworker/12:1    state:D stack:0     pid:264428 ppid:2      flags:0x00004000
[Wed Jul  3 10:13:11 2024] Workqueue: ice ice_service_task [ice]
[Wed Jul  3 10:13:11 2024] Call Trace:
[Wed Jul  3 10:13:11 2024]  <TASK>
[Wed Jul  3 10:13:11 2024]  __schedule+0x3bb/0x1650
[Wed Jul  3 10:13:11 2024]  ? preempt_schedule_thunk+0x1a/0x20
[Wed Jul  3 10:13:11 2024]  schedule+0x6f/0x120
[Wed Jul  3 10:13:11 2024]  synchronize_irq+0x7c/0xb0
[Wed Jul  3 10:13:11 2024]  ? __pfx_autoremove_wake_function+0x10/0x10
[Wed Jul  3 10:13:11 2024]  ice_vsi_dis_irq+0x17e/0x1a0 [ice]
[Wed Jul  3 10:13:11 2024]  ice_down+0x55/0x2e0 [ice]
[Wed Jul  3 10:13:11 2024]  ice_vsi_close+0xb8/0xc0 [ice]
[Wed Jul  3 10:13:11 2024]  ice_dis_vsi+0x4c/0x80 [ice]
[Wed Jul  3 10:13:11 2024]  ice_pf_dis_all_vsi.constprop.0+0x35/0xf0 [ice]
[Wed Jul  3 10:13:11 2024]  ice_prepare_for_reset+0x1c3/0x420 [ice]
[Wed Jul  3 10:13:11 2024]  ice_do_reset+0x35/0x150 [ice]
[Wed Jul  3 10:13:11 2024]  ice_service_task+0x489/0x19c0 [ice]
[Wed Jul  3 10:13:11 2024]  ? __schedule+0x3c3/0x1650
[Wed Jul  3 10:13:11 2024]  ? queue_delayed_work_on+0x45/0x50
[Wed Jul  3 10:13:11 2024]  process_one_work+0x21c/0x490
[Wed Jul  3 10:13:11 2024]  worker_thread+0x54/0x3e0
[Wed Jul  3 10:13:11 2024]  ? __pfx_worker_thread+0x10/0x10
[Wed Jul  3 10:13:11 2024]  kthread+0x11c/0x140
[Wed Jul  3 10:13:11 2024]  ? __pfx_kthread+0x10/0x10
[Wed Jul  3 10:13:11 2024]  ret_from_fork+0x29/0x50
[Wed Jul  3 10:13:11 2024]  </TASK>
[Wed Jul  3 10:18:35 2024] ice 0000:01:00.0: PTP reset successful
[Wed Jul  3 10:18:35 2024] ice 0000:01:00.0: GNSS init successful
[Wed Jul  3 10:18:35 2024] ice 0000:01:00.0: VSI rebuilt. VSI index 0, type ICE_VSI_PF
[Wed Jul  3 10:18:35 2024] ice 0000:01:00.0: VSI rebuilt. VSI index 1, type ICE_VSI_CTRL
[Wed Jul  3 10:18:35 2024] ice 0000:01:00.0: VF 0 reset check timeout

Do you have any hints on how to resolve?

 

Thank you.

0 Kudos
IntelSupport
Community Manager
1,410 Views

Hello fedecive,


Greetings for the day!


Thank you for your response. Please allow us some time while we check internally.


Best regards,

Amina



0 Kudos
fedecive
New Contributor I
1,270 Views

Hi Amina,

 

after searching in the web, I found a patch for a problem that is really similar to the one I have. Here is the link to the patch: https://lore.kernel.org/lkml/20240611193837.4ffb2401@kernel.org/T/

 

It seems a concurrent access to the ring buffers and this causes a system crash which is really similar to my issue.

 

Another update regards the ring buffers dimension: we tried to increase the value as the maximum allowable for the NIC (i.e., 8160). With this change, the system seems more stable, probably because it receives less interrupts thanks to the jumbo frames.

 

Could you please provide me with some feedbacks on this?

 

Thank you.

0 Kudos
fedecive
New Contributor I
991 Views

Hi Amina,

 

any news on this thread? I provided you with all the information you asked but I did not receive any hints from you.

 

Could you please let me know if Intel is still working on this topic?

 

Thank you.

0 Kudos
Irwan_Intel
Moderator
876 Views

Hello fedecive,


Apologies for the delay. Our engineering team is still working on this issue. We request the following additional information:

1. On Which interface dpdk is running - PF/VF?

2. Is this issue occurs only on RT kernel?

3. Also, would it be possible to try the latest combinations of our FW, driver and DPDK?


This could be related to PTP timing issue with RT kernel and the errors are also coming from the ice driver.

Please check the kernel logs for the first occurrence of crash. It will be helpful to check if PTP is really causing it or crash happened before PTP initialization. 


Regards,

Irwan_Intel


0 Kudos
fedecive
New Contributor I
856 Views

Hi Irwan,

thank you for your reply. I try to answer your questions:

  1. Here is the summary of the interface used for DPDK:

 

enp1s0f0np0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9600 qdisc mq state UP mode DEFAULT group default qlen 1000
link/ether 50:7c:6f:57:6a:54 brd ff:ff:ff:ff:ff:ff
vf 0 link/ether 00:11:22:33:44:66 brd ff:ff:ff:ff:ff:ff, vlan 100, spoof checking off, link-state auto, trust on
vf 1 link/ether 00:11:22:33:44:66 brd ff:ff:ff:ff:ff:ff, vlan 100, spoof checking off, link-state auto, trust on

 

Hope this helps.

0 Kudos
IntelSupport
Community Manager
842 Views

Hello fedecive,


Greetings for the day!


Thank you for your response. Please keep us updated with the outcome once you have completed the task, so we can proceed with our next plan of action.


Best regards,

Amina


0 Kudos
Reply