Intel® QuickAssist Technology (Intel® QAT)
For questions and discussions related to Intel® QuickAssist Technology (Intel® QAT).
66 Discussions

IPSEC Tunnel XFRM VTI + QAT LKCF rcu_sched stall

SmithConnect
Beginner
466 Views

Greetings, 

 

I am experiencing an issue with the use of QAT in conjunction with IKEv2 IPSEC site to site tunnels built using Strongswan. The issue is replicable after a large amount of traffic is sent over the tunnel, typically after running a test such as "iperf -c x.x.x.x -P 25 -t 60". The system will ultimately begin reporting a soft lockup on CPU# and must be restarted to restore functionality. 

I can confirm this issue only occurs when intel_qat is loaded on the system as disabling the module removes the issue entirely (and along with it the speed benefits of using QAT).

 

lockup.png

 

 

 

Jan 25 23:41:26 debian kernel: [13816.626750] vmxnet3 0000:13:00.0 eth1: NETDEV WATCHDOG: CPU: 1: transmit queue 2 timed out 12347449 ms
Jan 25 23:41:26 debian kernel: [13816.629901] vmxnet3 0000:13:00.0 eth1: tx hang
Jan 25 23:41:26 debian kernel: [13816.632882] rcu: INFO: rcu_sched detected expedited stalls on CPUs/tasks: { 0-.... } 12343656 jiffies s: 545 root: 0x1/.
Jan 25 23:41:26 debian kernel: [13816.635987] rcu: blocking rcu_node structures (internal RCU debug):
Jan 25 23:41:26 debian kernel: [13816.639006] Sending NMI from CPU 1 to CPUs 0:
Jan 25 23:41:26 debian kernel: [13816.639036] NMI backtrace for cpu 0
Jan 25 23:41:26 debian kernel: [13816.639040] CPU: 0 PID: 56 Comm: kworker/0:1H Tainted: G           O L     6.6.69-vyos #1
Jan 25 23:41:26 debian kernel: [13816.639046] Hardware name: VMware, Inc. VMware7,1/440BX Desktop Reference Platform, BIOS VMW71.00V.18227214.B64.2106252220 06/25/2021
Jan 25 23:41:26 debian kernel: [13816.639050] Workqueue: adf_vf_resp_wq_ adf_response_handler_wq [intel_qat]
Jan 25 23:41:26 debian kernel: [13816.639162] RIP: 0010:native_queued_spin_lock_slowpath+0x2b/0x260
Jan 25 23:41:26 debian kernel: [13816.639173] Code: 55 41 54 55 53 48 89 fb 66 90 ba 01 00 00 00 8b 03 85 c0 75 13 f0 0f b1 13 85 c0 75 f2 5b 5d 41 5c 41 5d c3 cc cc cc cc f3 90 <eb> e3 81 fe 00 01 00 00 74 4a 81 fe ff 00 00 00 77 77 f0 0f ba 2b
Jan 25 23:41:26 debian kernel: [13816.639177] RSP: 0018:ffffb44d80003ae8 EFLAGS: 00000202
Jan 25 23:41:26 debian kernel: [13816.639182] RAX: 0000000000000001 RBX: ffff8ce4b44af34c RCX: ffff8ce4b44af348
Jan 25 23:41:26 debian kernel: [13816.639186] RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffff8ce4b44af34c
Jan 25 23:41:26 debian kernel: [13816.639189] RBP: ffffb44d80003b78 R08: 0000000069d34ec2 R09: 000000000000000a
Jan 25 23:41:26 debian kernel: [13816.639193] R10: 0000000000000004 R11: ffff8ce4848f33c8 R12: 0000000000000002
Jan 25 23:41:26 debian kernel: [13816.639197] R13: 000000000000000a R14: ffff8ce4b44af34c R15: ffff8ce4b44af300
Jan 25 23:41:26 debian kernel: [13816.639201] FS:  0000000000000000(0000) GS:ffff8ce4bbc00000(0000) knlGS:0000000000000000
Jan 25 23:41:26 debian kernel: [13816.639205] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jan 25 23:41:26 debian kernel: [13816.639209] CR2: 0000559076aec3c0 CR3: 0000000105c50003 CR4: 00000000003706f0
Jan 25 23:41:26 debian kernel: [13816.639218] Call Trace:
Jan 25 23:41:26 debian kernel: [13816.639221]  <NMI>
Jan 25 23:41:26 debian kernel: [13816.639238]  ? nmi_cpu_backtrace+0x95/0x110
Jan 25 23:41:26 debian kernel: [13816.639245]  ? nmi_cpu_backtrace_handler+0x8/0x10
Jan 25 23:41:26 debian kernel: [13816.639253]  ? nmi_handle+0x4e/0x120
Jan 25 23:41:26 debian kernel: [13816.639261]  ? default_do_nmi+0x44/0x250
Jan 25 23:41:26 debian kernel: [13816.639267]  ? exc_nmi+0xfe/0x130
Jan 25 23:41:26 debian kernel: [13816.639273]  ? end_repeat_nmi+0x16/0x67
Jan 25 23:41:26 debian kernel: [13816.639286]  ? native_queued_spin_lock_slowpath+0x2b/0x260
Jan 25 23:41:26 debian kernel: [13816.639294]  ? native_queued_spin_lock_slowpath+0x2b/0x260
Jan 25 23:41:26 debian kernel: [13816.639302]  ? native_queued_spin_lock_slowpath+0x2b/0x260
Jan 25 23:41:26 debian kernel: [13816.639310]  </NMI>
Jan 25 23:41:26 debian kernel: [13816.639311]  <IRQ>
Jan 25 23:41:26 debian kernel: [13816.639314]  _raw_spin_lock+0x19/0x20
Jan 25 23:41:26 debian kernel: [13816.639320]  xfrm_input+0x1dc/0x11f0
Jan 25 23:41:26 debian kernel: [13816.639333]  xfrm6_rcv_encap+0xec/0x1e0
Jan 25 23:41:26 debian kernel: [13816.639344]  ? __pfx_xfrm6_udp_encap_rcv+0x10/0x10
Jan 25 23:41:26 debian kernel: [13816.639352]  udpv6_queue_rcv_one_skb+0x259/0x520
Jan 25 23:41:26 debian kernel: [13816.639361]  udp6_unicast_rcv_skb+0x40/0xa0
Jan 25 23:41:26 debian kernel: [13816.639369]  ip6_protocol_deliver_rcu+0x181/0x480
Jan 25 23:41:26 debian kernel: [13816.639377]  ip6_input_finish+0x35/0x60
Jan 25 23:41:26 debian kernel: [13816.639384]  ip6_sublist_rcv_finish+0x54/0x90
Jan 25 23:41:26 debian kernel: [13816.639392]  ip6_sublist_rcv+0x236/0x2d0
Jan 25 23:41:26 debian kernel: [13816.639399]  ? __pfx_ip6_rcv_finish+0x10/0x10
Jan 25 23:41:26 debian kernel: [13816.639407]  ipv6_list_rcv+0x136/0x160
Jan 25 23:41:26 debian kernel: [13816.639416]  __netif_receive_skb_list_core+0x1f1/0x2c0
Jan 25 23:41:26 debian kernel: [13816.639429]  netif_receive_skb_list_internal+0x1a7/0x2d0
Jan 25 23:41:26 debian kernel: [13816.639437]  napi_complete_done+0x69/0x1a0
Jan 25 23:41:26 debian kernel: [13816.639443]  vmxnet3_poll_rx_only+0x7b/0xa0 [vmxnet3]
Jan 25 23:41:26 debian kernel: [13816.639480]  __napi_poll+0x23/0x1a0
Jan 25 23:41:26 debian kernel: [13816.639485]  net_rx_action+0x141/0x2c0
Jan 25 23:41:26 debian kernel: [13816.639490]  ? __napi_schedule+0xa7/0xb0
Jan 25 23:41:26 debian kernel: [13816.639498]  handle_softirqs+0xd2/0x280
Jan 25 23:41:26 debian kernel: [13816.639505]  __irq_exit_rcu+0x68/0x90
Jan 25 23:41:26 debian kernel: [13816.639509]  common_interrupt+0x7a/0xa0
Jan 25 23:41:26 debian kernel: [13816.639515]  </IRQ>
Jan 25 23:41:26 debian kernel: [13816.639516]  <TASK>
Jan 25 23:41:26 debian kernel: [13816.639518]  asm_common_interrupt+0x22/0x40
Jan 25 23:41:26 debian kernel: [13816.639526] RIP: 0010:netlink_has_listeners+0x2e/0x60
Jan 25 23:41:26 debian kernel: [13816.639535] Code: 03 00 00 a8 01 74 44 0f b7 87 04 02 00 00 48 8d 14 40 48 8d 04 90 31 d2 48 c1 e0 04 48 03 05 39 d8 d3 00 48 8b 88 90 00 00 00 <48> 85 c9 74 15 83 ee 01 3b b0 9c 00 00 00 73 0a 31 d2 48 0f a3 71
Jan 25 23:41:26 debian kernel: [13816.639540] RSP: 0018:ffffb44d8069fd48 EFLAGS: 00000282
Jan 25 23:41:26 debian kernel: [13816.639543] RAX: ffff8ce4803ca4e0 RBX: ffff8ce4b44af300 RCX: ffff8ce4b442b540
Jan 25 23:41:26 debian kernel: [13816.639547] RDX: 0000000000000000 RSI: 0000000000000005 RDI: ffff8ce4b234a000
Jan 25 23:41:26 debian kernel: [13816.639551] RBP: ffffb44d8069fdf0 R08: 0000000000000000 R09: 0000000000004bb7
Jan 25 23:41:26 debian kernel: [13816.639554] R10: 0000000000000010 R11: ffffffff86427f80 R12: 00000000b74b0000
Jan 25 23:41:26 debian kernel: [13816.639558] R13: 000000000000000a R14: ffff8ce4b44af34c R15: ffff8ce4b44af300
Jan 25 23:41:26 debian kernel: [13816.639565]  ? skb_copy_bits+0x1da/0x210
Jan 25 23:41:26 debian kernel: [13816.639571]  xfrm_replay_advance+0xf8/0x360
Jan 25 23:41:26 debian kernel: [13816.639582]  xfrm_input+0x4ce/0x11f0
Jan 25 23:41:26 debian kernel: [13816.639592]  qat_alg_callback+0x15/0x30 [intel_qat]
Jan 25 23:41:26 debian kernel: [13816.639711]  adf_handle_response+0x3d/0xc0 [intel_qat]
Jan 25 23:41:26 debian kernel: [13816.639815]  adf_response_handler_wq+0x6c/0xc0 [intel_qat]
Jan 25 23:41:26 debian kernel: [13816.639928]  process_one_work+0x175/0x310
Jan 25 23:41:26 debian kernel: [13816.639936]  worker_thread+0x279/0x3a0
Jan 25 23:41:26 debian kernel: [13816.639944]  ? __pfx_worker_thread+0x10/0x10
Jan 25 23:41:26 debian kernel: [13816.639949]  kthread+0xc4/0xf0
Jan 25 23:41:26 debian kernel: [13816.639959]  ? __pfx_kthread+0x10/0x10
Jan 25 23:41:26 debian kernel: [13816.639968]  ret_from_fork+0x28/0x40
Jan 25 23:41:26 debian kernel: [13816.639974]  ? __pfx_kthread+0x10/0x10
Jan 25 23:41:26 debian kernel: [13816.639983]  ret_from_fork_asm+0x1b/0x30
Jan 25 23:41:26 debian kernel: [13816.639995]  </TASK>

 

 

 

Some details regarding the setup in place.

 

OS: Debian 12

 

Driver Version: 4.27.0-00006

 

Configure:

 

 

 

./configure --enable-kapi --enable-qat-lkcf --enable-icp-sriov=guest

 

 

 

 

lsmod | grep qat

 

 

 

root@debian:~$ lsmod | grep qat
qat_c62xvf             32768  2
intel_qat             401408  7 qat_c62xvf
uio                    28672  1 intel_qat

 

 

 

 

service qat_service status

 

 

 

root@debian:~$ service qat_service status
○ qat_service.service - LSB: modprobe the QAT modules, which loads dependant modules, before calling the user space utility to pass configuration parameters
     Loaded: loaded (/etc/init.d/qat_service; generated)
     Active: inactive (dead)
       Docs: man:systemd-sysv-generator(8)

 

 

 

 cat /proc/ | grep qat

 

 

 

root@debian:~$ cat /proc/ | grep qat
driver       : echainiv(qat_aes_cbc_hmac_sha256)
driver       : rfc3686(qat_aes_ctr)
driver       : pkcs1pad(qat-rsa,sha512)
driver       : qat-rsa
module       : intel_qat
driver       : qat_aes_gcm
module       : intel_qat
driver       : qat_aes_cbc_hmac_sha512
module       : intel_qat
driver       : qat_aes_cbc_hmac_sha256
module       : intel_qat
driver       : qat_aes_xts
module       : intel_qat
driver       : qat_aes_ctr
module       : intel_qat
driver       : qat_aes_cbc
module       : intel_qat

 

 

 

/etc/c6xxvf_dev0.conf

 

 

 

#  version: QAT.L.4.27.0-00006
################################################################
[GENERAL]
ServicesEnabled = cy;dc

ConfigVersion = 2

#Default values for number of concurrent requests*/
CyNumConcurrentSymRequests = 512
CyNumConcurrentAsymRequests = 64

#Statistics, valid values: 1,0
statsGeneral = 1
statsDh = 1
statsDrbg = 1
statsDsa = 1
statsEcc = 1
statsKeyGen = 1
statsDc = 1
statsLn = 1
statsPrime = 1
statsRsa = 1
statsSym = 1

##############################################
# Kernel Instances Section
##############################################
[KERNEL]
NumberCyInstances = 1
NumberDcInstances = 1

#  - Kernel instance #0
Cy0Name = "IPSec0"
Cy0IsPolled = 0
Cy0CoreAffinity = 0

# Data Compression - Kernel instance #0
Dc0Name = "Dc1"
Dc0IsPolled = 0
# List of core affinities
Dc0CoreAffinity = 0

##############################################
# User Process Instance Section
##############################################
[SSL]
NumberCyInstances = 0
NumberDcInstances = 0
NumProcesses = 1
LimitDevAccess = 0

 

 

 

 I look forward to any suggestions from the community here as to troubleshoot or further pinpoint the root cause of the crashing.

0 Kudos
7 Replies
Ronny_G_Intel
Moderator
438 Views

Hi SmithConnect


The log entries you provided indicate a series of issues related to the Intel QuickAssist Technology (QAT) and network driver (vmxnet3).

The key messages include a network transmit queue timeout, a CPU stall detected by RCU (Read-Copy-Update), and a backtrace involving the QAT driver. 


1. Network Driver (vmxnet3) Issues

Transmit Queue Timeout: The message NETDEV WATCHDOG: CPU: 1: transmit queue 2 timed out indicates that the network driver (vmxnet3) is experiencing a transmit queue timeout. This can be caused by high network load, driver issues, or hardware problems.


2. RCU (Read-Copy-Update) Stalls

RCU Stalls: The message rcu_sched detected expedited stalls on CPUs/tasks indicates that the RCU subsystem detected a stall, which can be caused by long-running tasks or high CPU load.


3. Intel QAT Driver Issues

QAT Driver Backtrace: The backtrace involving the QAT driver (adf_response_handler_wq) suggests that there may be an issue with the QAT driver or its interaction with the system.

Now, issues 1 and 2 can be affecting the functionality of QAT: 


If the network driver is experiencing timeouts or other issues, it can lead to delays or failures in data transmission. Since Intel QAT often handles cryptographic operations for network traffic, any disruption in the network driver can directly impact the performance and reliability of QAT operations.

RCU stalls indicate that the system is experiencing high CPU load or long-running tasks, which can affect the overall system performance. Intel QAT relies on the CPU for processing cryptographic operations, and any CPU-related issues can degrade the performance of QAT. High CPU load can also lead to increased latency and reduced throughput for QAT operations.


I would recommend addressing the first 2 issues and see if when running QAT still behaves the same.

If the issue persists after trying the recommendations above please share the following data:


1. icp_dump log files. To generate this, run the script located at $ICP_ROOT/quickassist/utilities/debug_tool/icp_dump.sh. This will create a tar file containing your full system setup, including configuration files.


2. config.log file. It should be located in the $ICP_ROOT/ directory.


3. dmesg log file, same log that you just provided.


Thanks,

Ronny G


0 Kudos
SmithConnect
Beginner
422 Views

Greetings Ronny,

 

While I appreciate it, response provided does not give me any truly actionable steps. 

 

"If the network driver is experiencing timeouts or other issues, it can lead to delays or failures in data transmission. Since Intel QAT often handles cryptographic operations for network traffic, any disruption in the network driver can directly impact the performance and reliability of QAT operations."

I can run the same amount of traffic through a tunnel with the QAT module uninstalled and I do not experience any cpu stalls/crashes. If the network driver were at fault, the problem should still persists in this alternate configuration.

 

"RCU stalls indicate that the system is experiencing high CPU load or long-running tasks, which can affect the overall system performance. Intel QAT relies on the CPU for processing cryptographic operations, and any CPU-related issues can degrade the performance of QAT. High CPU load can also lead to increased latency and reduced throughput for QAT operations"

While CPU usage increases, as would be expected as the CPU handles the interrupts that are required for QAT operations. The issue at hand is not increased latency or reduced throughput but rather the processing of anything on the machine fails.

SmithConnect_1-1738204138282.png

 

 

 

 

0 Kudos
Ronny_G_Intel
Moderator
396 Views

Hi SmithConnect,


Thank you for providing the icp_dump and config.log files. I will review them and get back to you as soon as possible.

I have one more question: Based on your previous response, can I assume that if the QAT module is uninstalled, the vmxnet3 timeout issues and RCU stalls do not occur?


Regards,

Ronny G


0 Kudos
SmithConnect
Beginner
363 Views

Greetings Ronny,

 

That assumptions is correct. When the QAT module is not installed, there are no stability issues.

 

Regards,

0 Kudos
SmithConnect
Beginner
229 Views

Greetings Ronny,

 

Is there any additional information I can provide to continue aiding in your review of this issue?

 

Regards,

0 Kudos
Ronny_G_Intel
Moderator
103 Views

Hi SmithConnect,


I checked your configuration files and logs and couldn't really detect any particular issue.

Please confirm that you are mainly using vfs, see below:


There is 4 QAT acceleration device(s) in the system:

qat_dev0 - type: c6xxvf, inst_id: 0, node_id: 0, bsf: 0000:05:00.0, #accel: 1 #engines: 1 state: up

qat_dev1 - type: c6xxvf, inst_id: 1, node_id: 0, bsf: 0000:0c:00.0, #accel: 1 #engines: 1 state: up

qat_dev2 - type: c6xxvf, inst_id: 2, node_id: 0, bsf: 0000:14:00.0, #accel: 1 #engines: 1 state: up

qat_dev3 - type: c6xxvf, inst_id: 3, node_id: 0, bsf: 0000:1c:00.0, #accel: 1 #engines: 1 state: up


And that intel_iommu is not enabled: BOOT_IMAGE=/boot/1.5-rolling-202501151018/vmlinuz boot=live rootdelay=5 noautologin net.ifnames=0 biosdevname=0 vyos-union=/boot/1.5-rolling-202501151018 console=tty0


I am running this issue by the QAT team and will get back to you as soon as possible.


Regards,

Ronny G



0 Kudos
Ronny_G_Intel
Moderator
91 Views

Hi SmithConnect,


When running Intel QAT in a virtual machine, you must have IOMMU (Input/Output Memory Management Unit) enabled by setting "iommu=on" in your system configuration, as IOMMU is crucial for properly managing memory access for the QAT device within a virtualized environment. 

Based on the configuration, it seems that you are operating within a VM. Intel IOMMU needs to be enabled on the host side.


On the other hand, is this a new issue? Was this setup functioning properly before and then stopped, or is this a new setup where the issue has just been identified?

I reviewed the provided dmesg file, and contains only six QAT references. I didn't notice any errors in this log. Do you have any logs that include errors? 


Thanks,

Ronny G


0 Kudos
Reply