Software Tuning, Performance Optimization & Platform Monitoring
Discussion regarding monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform updating.

PCM reports low QAT card PCIe traffic

Alexander_Alexeev
1,393 Views

Hello 

I have a problem to monitor QAT card PCIe traffic with PCM. While two cards produce ~5800 MB/s read traffic for memory and ~3100 MB/s write traffic for memory from PCIe. Numbers reported by ./pcm-pcie.x are not even close. ~100 MB/s are reported for reads and for writes.

Could you clarify possible reason?

QAT -  https://01.org/packet-processing/intel%C2%AE-quickassist-technology-drivers-and-patches

 

Environment and tools output

Fedora release 16 (Verne)

Linux intel45 3.1.0-7.fc16.x86_64 #1 SMP Tue Nov 1 21:10:48 UTC 2011 x86_64 x86_64 x86_64 GNU/Linux

CPU Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz

Two QAT cards. Intel Quick Assist Adapter 8950. They are both pluged into socket 1 (node_id=1) root complex.

Intel® QuickAssist Technology Driver (L.2.2.0-30), QAT 1.6

 

 

OUTPUT

./pcm-pcie.x 5

Skt | PCIeRdCur | PCIeNSRd  | PCIeWiLF | PCIeItoM | PCIeNSWr | PCIeNSWrF
 0       130 K         0           0          0          0          0
 1       465 M         0           0        247 M        0          0
-----------------------------------------------------------------------------------
 *        465 M         0           0        495 M        0          0

 

./pcm-memory.x  OUTPUT

---------------------------------------||---------------------------------------
--             Socket 0              --||--             Socket 1              --
---------------------------------------||---------------------------------------
---------------------------------------||---------------------------------------
---------------------------------------||---------------------------------------
--   Memory Performance Monitoring   --||--   Memory Performance Monitoring   --
---------------------------------------||---------------------------------------
--  Mem Ch 0: Reads (MB/s):  736.92  --||--  Mem Ch 0: Reads (MB/s):  946.24  --
--            Writes(MB/s):  407.56  --||--            Writes(MB/s):  466.91  --
--  Mem Ch 1: Reads (MB/s):  723.93  --||--  Mem Ch 1: Reads (MB/s):  924.22  --
--            Writes(MB/s):  393.20  --||--            Writes(MB/s):  411.42  --
--  Mem Ch 2: Reads (MB/s):  722.76  --||--  Mem Ch 2: Reads (MB/s):  948.64  --
--            Writes(MB/s):  411.03  --||--            Writes(MB/s):  468.38  --
--  Mem Ch 3: Reads (MB/s):  725.39  --||--  Mem Ch 3: Reads (MB/s):  950.83  --
--            Writes(MB/s):  406.36  --||--            Writes(MB/s):  461.23  --
-- NODE0 Mem Read (MB/s):   2909.00  --||-- NODE1 Mem Read (MB/s):   3769.93  --
-- NODE0 Mem Write (MB/s):  1618.14  --||-- NODE1 Mem Write (MB/s):  1807.94  --
-- NODE0 P. Write (T/s) :    140529  --||-- NODE1 P. Write (T/s):     143146  --
-- NODE0 Memory (MB/s):     4527.14  --||-- NODE1 Memory (MB/s):     5577.87  --
---------------------------------------||---------------------------------------
--                   System Read Throughput(MB/s):   6678.93                  --
--                  System Write Throughput(MB/s):   3426.08                  --
--                 System Memory Throughput(MB/s):  10105.01                  --
---------------------------------------||---------------------------------------

 

QAT compression benchmark OUTPUT 

---------------------------------------
API                    Data_Plane
Session State          STATELESS
Algorithm              DEFLATE
Huffman Type           STATIC
Mode                   ASYNCHRONOUS
Direction              COMPRESS
Packet Size            8192
Compression Level      1
Corpus                 CALGARY_CORPUS
Number of threads      24
Total Responses        3801600
Total Retries          122831954
Clock Cycles Start     2521268357176456
Clock Cycles End       2521279034160452
Total Cycles           10676983996
CPU Frequency(kHz)     1995869
Throughput(Mbps)       46577
Compression Ratio      45.2%
---------------------------------------

 

0 Kudos
5 Replies
Patrick_L_Intel
Employee
1,393 Views

Hi Alex,

The pcm-pcie stats looks fine to me. A few key points

  1. You're sampling in 5 second interval, so to calculate MB/second, you need to divide the event counts by 5.
  2. The event count is in number of cache line (64 bytes), so to estimate bandwidth, you need to then multiply the count by 64 or use the -B flag.

Knowing above two rules, we can asset the I/O bandwidth.

$ = cache line

PCIeItoM (Inbound allocating write): 247M$ / 5 seconds * 64 Bytes / $ = 3161.6MB/s

PCIeRdCur (outbound read): 465M$ / 5 seconds * 64 Bytes / $ = 5952MB/s

which correlates with your statement quite closely.

Best Regards,

Patrick

0 Kudos
Alexander_Alexeev
1,393 Views

Thanks, my bad!

0 Kudos
Patrick_L_Intel
Employee
1,393 Views

You're welcome. No problem :)

0 Kudos
Alexander_Alexeev
1,393 Views

Hi, I tryed to run ./pcm-pcie.x -B and seems write sum for sockets doesn't match. If I am not doing some silly mistake again.

Detected Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz "Intel(r) microarchitecture codename Sandy Bridge-EP/Jaketown"

Update every 5 seconds
delay_ms: 417
Skt | PCIeRdCur | PCIeNSRd  | PCIeWiLF | PCIeItoM | PCIeNSWr | PCIeNSWrF | PCIe Rd (B) | PCIe Wr (B)
 0       130 K         0           0          0          0          0          8358 K            0
 1       465 M         0           0        247 M        0          0            29 G           15 G
----------------------------------------------------------------------------------------------------------------
 *        465 M         0           0        495 M        0          0            29 G           31 G

 

0 Kudos
Patrick_L_Intel
Employee
1,393 Views

 

Thanks for reporting your issue. I double checked the code and we have a typo in one of the event aggregations which cause the double count in the total sum. We will roll out another version with fix soon, but if you want to patch this one line code manually, here is the diff

diff --git a/pcm-pcie.cpp b/pcm-pcie.cpp
index cdd9847..d541b88 100644
--- a/pcm-pcie.cpp
+++ b/pcm-pcie.cpp
@@ -838,7 +838,7 @@ void getPCIeEvents(PCM *m, PCM::PCIeEventCode opcode, uint32 delay_ms, sample_t
                 sample.total.PCIeNSWr += (sizeof(PCIeEvents_t)/sizeof(uint64)) * getNumberOfEvents(before, after);
                 sample.miss.PCIeNSWr += (sizeof(PCIeEvents_t)/sizeof(uint64)) * getNumberOfEvents(before2, after2);
                 sample.hit.PCIeNSWr += (sample.total.PCIeNSWr > sample.miss.PCIeNSWr) ? sample.total.PCIeNSWr - sample.miss.PCIeNSWr : 0;
-                aggregate_sample.PCIeItoM += sample.total.PCIeItoM;
+                aggregate_sample.PCIeNSWr += sample.total.PCIeNSWr;
                 break;
             case PCM::PCIeNSWrF:
                 sample.total.PCIeNSWrF += (sizeof(PCIeEvents_t)/sizeof(uint64)) * getNumberOfEvents(before, after);

Thanks again for reporting the bug!

Sincerely,

Patrick

Alexander Alexeev wrote:

Hi, I tryed to run ./pcm-pcie.x -B and seems write sum for sockets doesn't match. If I am not doing some silly mistake again.

Detected Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz "Intel(r) microarchitecture codename Sandy Bridge-EP/Jaketown"

Update every 5 seconds
delay_ms: 417
Skt | PCIeRdCur | PCIeNSRd  | PCIeWiLF | PCIeItoM | PCIeNSWr | PCIeNSWrF | PCIe Rd (B) | PCIe Wr (B)
 0       130 K         0           0          0          0          0          8358 K            0
 1       465 M         0           0        247 M        0          0            29 G           15 G
----------------------------------------------------------------------------------------------------------------
 *        465 M         0           0        495 M        0          0            29 G           31 G

 

0 Kudos
Reply