Does HugePages improve the performance of QAT?

GeekPwnStyle · ‎03-20-2025

Hello everyone，

when I was optimizing the performance of Haproxy's CPS based on QAT, I encountered some issues regarding the use of hugepage memory for a one-step improvement in performance.

1、insmod $MOD_PATH/usdm_drv.ko max_huge_pages=500 max_huge_pages_per_process=10

2、AnonHugePages: 1101824 kB
ShmemHugePages: 0 kB
HugePages_Total: 1024
HugePages_Free: 1024
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
Hugetlb: 2097152 kB

3、mkdir /dev/hugepages/qat

After the Haproxy program starts based on QAT HugePage, one can observe that "HugePages_Free: 1024" has changed to "HugePages_Free: 1017", which indicates that QAT is using HugePages. However, when I continued to test CPS, I failed to successfully enhance the CPS performance.

Do the R&D engineers have any test documents or comparative data indicating that enabling HugePage can further enhance the processing capacity of QAT?

[GENERAL]
ServicesEnabled = cy

ServicesProfile = DEFAULT

ConfigVersion = 2

#Default values for number of concurrent requests*/
CyNumConcurrentSymRequests = 512
CyNumConcurrentAsymRequests = 64

#Statistics, valid values: 1,0
statsGeneral = 1
statsDh = 1
statsDrbg = 1
statsDsa = 1
statsEcc = 1
statsKeyGen = 1
statsDc = 1
statsLn = 1
statsPrime = 1
statsRsa = 1
statsSym = 1

DcIntermediateBufferSizeInKB = 64

# This flag is to enable device auto reset on heartbeat error
AutoResetOnError = 0

##############################################
# Kernel Instances Section
##############################################
[KERNEL]
NumberCyInstances = 0
NumberDcInstances = 0

##############################################
# User Process Instance Section
##############################################
[SHIM]
NumberCyInstances = 4
NumberDcInstances = 0
NumProcesses = 4
LimitDevAccess = 1

# - User instance #0
Cy0Name = "UserCY0"
Cy0IsPolled = 2
# List of core affinities
Cy0CoreAffinity = 0

# - User instance #1
= "UserCY1"
Cy1IsPolled = 2
# List of core affinities
Cy1CoreAffinity = 0

# - User instance #2
Cy2Name = "UserCY2"
Cy2IsPolled = 2
# List of core affinities
Cy2CoreAffinity = 0

# - User instance #3
Cy3Name = "UserCY3"
Cy3IsPolled = 2
# List of core affinities
Cy3CoreAffinity = 0

qat_service status
Checking status of all devices.
There is 3 QAT acceleration device(s) in the system:
qat_dev0 - type: c6xx, inst_id: 0, node_id: 1, bsf: 0000:84:00.0, #accel: 5 #engines: 10 state: up
qat_dev1 - type: c6xx, inst_id: 1, node_id: 1, bsf: 0000:85:00.0, #accel: 5 #engines: 10 state: up
qat_dev2 - type: c6xx, inst_id: 2, node_id: 1, bsf: 0000:86:00.0, #accel: 5 #engines: 10 state: up

Ronny_G_Intel · ‎03-21-2025

Hi GeekPwnStyle,

I see that you have configured HugePages with max_huge_pages=500 and max_huge_pages_per_process=10.

Are these values appropriate for your workload and system memory?

HugePages_Free count decreasing from 1024 to 1017 is an indication that HugePages are being used, but it doesn't mean that the allocation is optimal for your specific workload.

Consider increasing max_huge_pages_per_process if Haproxy can benefit from more HugePages (this needs to be further investigated).

You have enabled cryptographic services (cy) with 4 user instances. Are these instances being utilized by Haproxy?

How about core affinities (Cy0CoreAffinity, Cy1CoreAffinity,...)? Are these cores not heavily loaded with other tasks?

I will be doing some more research and will get back to you.

I hope this helps.

Regards,

Ronny G

Ronny_G_Intel · ‎03-25-2025

Hi GeekPwnStyle,

Did you have a chance to check into my previous post?

Thanks,

Ronny G

Ronny_G_Intel · ‎03-27-2025

Hi GeekPwnStyle,

I am just following up on this issue. Please let me know if you have any updates.

Thanks,

Ronny G

Ronny_G_Intel · ‎03-31-2025

Hi GeekPwnStyle,

I hope everything is going well for you. Since I haven't heard from you in some time, I will be closing the internal ticket we opened for this matter. The community will still be active, but I will no longer be monitoring this issue. If you require further assistance, please feel free to start a new community post.

Regards,

Ronny G

GeekPwnStyle · ‎04-02-2025

Dear Ronny G,

I hope this email finds you well.

First and foremost, I would like to sincerely thank you for your persistent follow-up on this matter. I deeply apologize for the delayed response due to my recent work commitments, which required me to be away and unable to monitor the community updates promptly. I regret any inconvenience this may have caused.

Regarding the QAT accelerator card performance issue we discussed earlier, our subsequent testing revealed potential alignment with the specifications outlined in Intel’s QAT documentation. Our high-end device operates at a baseline CPS (Connections Per Second) that is inherently high. After enabling QAT hardware acceleration, while the TPS (Transactions Per Second) improvement fell short of expectations, we observed a significant reduction in CPU load and latency, which remains a valuable optimization. We hypothesize that the current workload may be approaching or exceeding the QAT accelerator’s maximum supported TPS capacity, thereby limiting further CPS gains. This hypothesis is further supported by our test with huge pages, which showed no additional CPS improvement.

To better diagnose this scenario, I would like to inquire: During Intel’s internal benchmarking of QAT hardware (HW) versus software (SW) acceleration, are there any real-time monitoring tools—similar to top that provide visibility into QAT utilization metrics (e.g., workload utilization, queue depth, or throughput)? Such tools would help us pinpoint whether the bottleneck stems from QAT’s hardware limits or other configuration optimizations.

Your continued guidance is greatly appreciated. If any tools or documentation could be shared, it would significantly aid our analysis.

Thank you once again for your patience and support.

Best regards,
GeekPwnStyle