Re: rings_reserved leaks in QAT20

paulinusking · ‎07-03-2025

QAT 20, version=1.2.30.00090

this is the current rings_reserved:

cat /sys/class/uio/uio11/device/uio_ctrl/bundle_*/rings_reserved
0x0003: PID 117319, rings 0x0003.
0x0003: PID 153435, rings 0x0003.
0x0003: PID 153435, rings 0x0003.
0x0003: PID 166943, rings 0x0003.
0x0003: PID 166943, rings 0x0003.
0x0003: PID 166943, rings 0x0003.
0x0003: PID 166943, rings 0x0003.

but these processes already exit, cannot found.

these situation make new isa_user_start failed, error message is:

kernel:

kernel: QAT: Bundle 36, rings 0x0001 already reserved
kernel: QAT: Bundle 48, rings 0x0001 already reserved
kernel: QAT: Bundle 36, rings 0x0001 already reserved
kernel: QAT: Bundle 48, rings 0x0001 already reserved

user:

[error] SalCtrl_CompressionInit() - : Failed to create DC TX handle
[error] SalCtrl_ServiceInit() - : Failed to initialise all service instances
ADF_UIO_PROXY err: adf_user_subsystemInit: Failed to initialise Subservice SAL
[error] SalCtrl_ServiceEventStart() - : Private data is NULL

I thought the rings are leaked, cos cannot found any other processs using qat uio now

Ronny_G_Intel · ‎07-07-2025

Hi paulinusking,

Thanks for reaching out to Intel Communities.

I see that you are using QAT version 2.0 and running latest driver available version 1.2.0-00090.

You are getting the below error:

[error] SalCtrl_CompressionInit() - : Failed to create DC TX handle

[error] SalCtrl_ServiceInit() - : Failed to initialise all service instances

ADF_UIO_PROXY err: adf_user_subsystemInit: Failed to initialise Subservice SAL

[error] SalCtrl_ServiceEventStart() - : Private data is NULL

With this information I can think of 2 possible scenarios:

1. When running openssl

Is this happening when trying to run openssl (or an openssl-based application) with QAT_engine with the USDM driver with huge pages?

If this is the case, please ensure that huge pages are created.

#cat /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages (should be greater than zero)

If the number of huge pages is zero, as an example, they can be increased temporarily as follows:

#echo 1024 > /proc/sys/vm/nr_hugepages

2. When running cpa_sample code

Is this happening when running cpa_sample code, it runs but fails and shows the error that you mentioned, or shows unusual errors with VFs.

If none of the scenarios mentioned above is applicable, please provide details regarding how to replicate this issue and include the icp_dump.

To generate the icp_dump, run the script located at $ICP_ROOT/quickassist/utilities/debug_tool/icp_dump.sh.

This will create a tar file containing your full system setup, including configuration files.

Regards,

Ronny G

paulinusking · ‎07-08-2025

1. we are using hugpages, and hugepages not zero. we use qat normal for a few days. It dose not work later and cannot resume.

2. after qat abnormal, I just write a test code to check qat.

auto status = qaeMemInit();

if (status != CPA_STATUS_SUCCESS) {

printf("qaeMemInit %d\n", status);

return -1;

}

status = icp_sal_userStart("DATANODE_5008");

if (status != CPA_STATUS_SUCCESS) {

printf("start failed %d\n", status); // it failed on it

qaeMemDestroy();

return -2;

}

3. icp_dump cannot upload 30MB file, so I split into 2 file, pls use

cat ICP_debug_14h_40m_49s_08d_07m_25y.*.gz. > ICP_debug_14h_40m_49s_08d_07m_25y.tar.gz

to resume

paulinusking · ‎07-08-2025

another attachment( I tar czf the second attachment because the upload limit)

pls tar xzf other.tgz first

DiegoV_Intel · ‎07-22-2025

Hi,

I may be able to provide some help on this. I'd like first to understand your setup. How were you using Intel® QAT before the issues started? I mean, is your setup using the Intel® QAT driver along with some other application? Or it's just the sample application that you are trying to run and it's not working?

Do you have any steps I can follow to see if I'm able to reproduce the issue and I get the same results?

Regards,

Diego V.

paulinusking · ‎07-24-2025

We have coded a compress server which using qat to compress. The server is multi process, each process using a name Section like "DATANODE_5xxx" to start qat instance(call icp_sal_userStart). as the /etc/4xxx*.conf , each process will get 4 * 8 = 32 instances for use.

Each server will use cpcDcNsCompressData/cpcDcNsDecompressData to compress or decompress.

When we are testing servers, we frequently restart( kill -9 then start) our services when compressing or decompressing. After a certain times, server cannot start qat instance (call icp_sal_userStart failed), and never resume until restart OS

DiegoV_Intel · ‎07-24-2025

Hi,

Thanks for that extra piece of information.

I was able to find this same behavior as a "fixed issue" in the QAT driver for hardware version 1.x. You are running on hardware version 2.0 so it is really not within scope, but the fact that it was an issue in a different hardware version makes me think of the possibility of a re-occurrence on this new hardware version.

I'll check with the development team to confirm this suspicious and share an update here.

In the meantime, these are the Release Notes where the fixed issue was documented (again, these are applicable to a different hardware version but just leaving them here for your reference): Intel® QuickAssist Technology Software for Linux* - Release Notes - Customer Enabling Release. See section 3.2.83.

Regards,

Diego V.

DiegoV_Intel · ‎07-31-2025

Hi,

I'm still investigating about this issue. I got a system where I can try to replicate the issue and see if I get the same result as you.

Do you have any script or reproduction steps I can follow? Are you able to replicate the issue with any of the sample codes available?

Regards,

Diego V.

DiegoV_Intel · ‎08-04-2025

Hi,

I've been trying to replicate the issue with no luck. I'm using a fresh environment with default configuration of the Intel QAT driver. To test compression and decompression services, I'm using the sample code initiated with the command ./build/cpa_sample_code runTests=32 dcLoops=1000 so that only compression/decompression operations are executed during an extended period of time.

I tried killing the process multiple times, but every time the QAT instances were up and ready again for another run. The only messages I see in the kernel are related to orphan rings which makes sense as I'm killing the process in the middle of a run, but I'm not seeing any of the reserved rings messages you posted above.

At this point, I'd need any script or consistent reproduction steps that I can follow to see if I can reproduce the issue. Are you able to replicate the issue with any of the sample codes available?

Regards,

Diego V.

paulinusking · ‎08-11-2025

reproduce method：

run the code: ./debug_qat2 $section_name
then run the scripts: python3 read_smap.py $pid
1. pid is the pid of debug_qat2
kill -9 $pid
then the rings are leaked,
1. can will see: cat /sys/class/uio/uio7/device/uio_ctrl/bundle_*/rings_reserved
2. or run Again， will see fail ./debug_qat2 $section_name

attachment is: debug_qat2 src code and makefile, read_smap.py

paulinusking · ‎08-06-2025

the reproduction case only exist in our app. I cannot reproduce it by sample code.

I cannot provide my app code. The way to reproduce by without-app-code have not been found ye.

By the way, can we have some method(like log) to debug when the case occur in our server

DiegoV_Intel · ‎08-07-2025

Hi,

Typically, the related logs are the ones included in the ICP Debug package that you already ran. I don't see however specific data points that can give more insights on a debugging effort. Let me investigate what options are available to further debug this issue.

Regards,

Diego V.

paulinusking · ‎08-07-2025

I have added debug info in to intel_qat.ko when add/remove rings_reserved.

when I reproduced, I saw the add processs is expected, but remove is not.

the remove process is prometheus-proc.

I thought the bug must be used with prometheus-process-exporter.

ps:

1. debug info is in the patch

2. we use prometheus-process-exporter to monitor our servers

3. the unexpected log is:

2025-08-07T19:22:19.416326+08:00 bms-airtrunk-d-h20-v5-app-10-192-124-15 kernel: process 5788 prometheus-proc cur 107552 5788 prometheus-proc try put rings mask cur 3
2025-08-07T19:22:19.416554+08:00 bms-airtrunk-d-h20-v5-app-10-192-124-15 kernel: message repeated 31 times: [ process 5788 prometheus-proc cur 107552 5788 prometheus-proc try put rings mask cur 3]
2025-08-07T21:45:34.444412+08:00 bms-airtrunk-d-h20-v5-app-10-192-124-15 kernel: process 5788 prometheus-proc cur 117319 5788 prometheus-proc try put rings mask cur 3
2025-08-07T21:45:34.444437+08:00 bms-airtrunk-d-h20-v5-app-10-192-124-15 kernel: message repeated 31 times: [ process 5788 prometheus-proc cur 117319 5788 prometheus-proc try put rings mask cur 3]
2025-08-07T22:51:19.332310+08:00 bms-airtrunk-d-h20-v5-app-10-192-124-15 kernel: process 5788 prometheus-proc cur 6011 5788 prometheus-proc try put rings mask cur 3
2025-08-07T22:51:19.333233+08:00 bms-airtrunk-d-h20-v5-app-10-192-124-15 kernel: message repeated 31 times: [ process 5788 prometheus-proc cur 6011 5788 prometheus-proc try put rings mask cur 3]
2025-08-07T23:51:34.106805+08:00 bms-airtrunk-d-h20-v5-app-10-192-124-15 kernel: process 5788 prometheus-proc cur 5822 5788 prometheus-proc try put rings mask cur 3
2025-08-07T23:51:34.106838+08:00 bms-airtrunk-d-h20-v5-app-10-192-124-15 kernel: message repeated 31 times: [ process 5788 prometheus-proc cur 5822 5788 prometheus-proc try put rings mask cur 3]
2025-08-08T00:55:18.914077+08:00 bms-airtrunk-d-h20-v5-app-10-192-124-15 kernel: process 5788 prometheus-proc cur 17889 5788 prometheus-proc try put rings mask cur 3
2025-08-08T00:55:18.914106+08:00 bms-airtrunk-d-h20-v5-app-10-192-124-15 kernel: message repeated 31 times: [ process 5788 prometheus-proc cur 17889 5788 prometheus-proc try put rings mask cur 3]
2025-08-08T03:19:04.092306+08:00 bms-airtrunk-d-h20-v5-app-10-192-124-15 kernel: process 5788 prometheus-proc cur 23980 5788 prometheus-proc try put rings mask cur 3
2025-08-08T03:19:04.093753+08:00 bms-airtrunk-d-h20-v5-app-10-192-124-15 kernel: message repeated 31 times: [ process 5788 prometheus-proc cur 23980 5788 prometheus-proc try put rings mask cur 3]
2025-08-08T06:12:19.262793+08:00 bms-airtrunk-d-h20-v5-app-10-192-124-15 kernel: process 5788 prometheus-proc cur 117383 5788 prometheus-proc try put rings mask cur 3
2025-08-08T06:12:19.262840+08:00 bms-airtrunk-d-h20-v5-app-10-192-124-15 kernel: message repeated 31 times: [ process 5788 prometheus-proc cur 117383 5788 prometheus-proc try put rings mask cur 3]
2025-08-08T08:18:19.044304+08:00 bms-airtrunk-d-h20-v5-app-10-192-124-15 kernel: process 5788 prometheus-proc cur 173298 5788 prometheus-proc try put rings mask cur 3
2025-08-08T08:18:19.045741+08:00 bms-airtrunk-d-h20-v5-app-10-192-124-15 kernel: message repeated 31 times: [ process 5788 prometheus-proc cur 173298 5788 prometheus-proc try put rings mask cur 3]
2025-08-08T08:51:19.040307+08:00 bms-airtrunk-d-h20-v5-app-10-192-124-15 kernel: process 5788 prometheus-proc cur 117319 5788 prometheus-proc try put rings mask cur 3
2025-08-08T08:51:19.041587+08:00 bms-airtrunk-d-h20-v5-app-10-192-124-15 kernel: message repeated 31 times: [ process 5788 prometheus-proc cur 117319 5788 prometheus-proc try put rings mask cur 3]
2025-08-08T08:53:19.060688+08:00 bms-airtrunk-d-h20-v5-app-10-192-124-15 kernel: process 5788 prometheus-proc cur 117383 5788 prometheus-proc try put rings mask cur 3
2025-08-08T08:53:19.061904+08:00 bms-airtrunk-d-h20-v5-app-10-192-124-15 kernel: message repeated 31 times: [ process 5788 prometheus-proc cur 117383 5788 prometheus-proc try put rings mask cur 3]

paulinusking · ‎08-07-2025

the bug is in this:

normal case:

our app process which use icp_sal_userStart call adf_ctl_ioctl_reserve_ring to add rings_reserved and call adf_uio_do_cleanup_orphan to remove rings_reserved.

unexpeteced case

our app process which use icp_sal_userStart call adf_ctl_ioctl_reserve_ring to add rings_reserved, but the process_exporter call adf_uio_do_cleanup_orphan , so the pid is not match, then leak

DiegoV_Intel · ‎08-08-2025

Hi,

Thanks for sharing this additional information. Let me investigate and check with our development team.

Regards,

Diego V.

DiegoV_Intel · ‎08-08-2025

Hi,

I just realized you had also opened a support ticket through the Intel Premier support channel. Let's continue over that channel so this potential bug can receive the proper priority level.

Regards,

Diego V.