Software Tuning, Performance Optimization & Platform Monitoring
Discussion regarding monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform updating.

Understanding output from pcm-numa.x and pcm-memory.x

Jeremie_Lagraviere
3,188 Views

Hi everyone,

These days I am using pcm-numa.x to measure memory data traffic.

And I would like to be sure of how to interpret the results provided by pcm-numa.x (called with the external_program parameter)

Here is an example:

Core | IPC  | Instructions | Cycles  |  Local DRAM accesses | Remote DRAM Accesses 
   0   1.06        359 G      340 G      3096 M              1151 M              
   1   1.06        359 G      339 G      3589 M               653 M              
   2   1.06        359 G      339 G      3459 M               787 M              
   3   1.06        360 G      339 G      3602 M               652 M              
   4   1.06        359 G      338 G      3435 M               808 M              
   5   1.06        359 G      339 G      3629 M               614 M              
   6   1.05        359 G      341 G      3080 M              1164 M              
   7   1.06        360 G      339 G      3132 M              1117 M              
   8   1.05        359 G      341 G      3237 M              1007 M              
   9   1.05        360 G      342 G      3565 M               693 M              
  10   1.05        359 G      341 G      3585 M               660 M              
  11   1.05        360 G      341 G      3441 M               806 M              
  12   1.06        360 G      340 G      3512 M               734 M              
  13   1.06        360 G      341 G      3374 M               879 M              
  14   1.05        358 G      341 G      3208 M              1027 M              
  15   1.05        359 G      340 G      3557 M               681 M              
  16   1.06        359 G      339 G      3691 M               551 M              
  17   1.06        359 G      340 G      3572 M               671 M              
  18   1.06        359 G      339 G      3527 M               719 M              
  19   1.05        359 G      340 G      2529 M              1714 M              
  20   1.06        360 G      338 G      3654 M               592 M              
  21   1.06        359 G      339 G      3666 M               578 M              
  22   1.05        360 G      342 G      2313 M              1941 M              
  23   1.06        359 G      338 G      3420 M               824 M              
  24   1.05        359 G      342 G      3656 M               591 M              
  25   1.05        360 G      343 G      3388 M               869 M              
  26   1.05        360 G      342 G      3332 M               918 M              
  27   1.05        359 G      341 G      3708 M               536 M              
  28   1.06        360 G      341 G      3517 M               736 M              
  29   1.06        360 G      340 G      3220 M              1046 M              
  30   1.05        360 G      341 G      3265 M               985 M              
  31   1.06        360 G      341 G      2821 M              1431 M              
-------------------------------------------------------------------------------------------------------------------
   *   1.06         11 T       10 T       107 G                28 G     

 

Local DRAM accesses: is the number of RAM access during the whole program ? So this value is not expressed in bytes, but is simply a counter of the number of accesses ?

Same question for remote DRAM accesses.

I am not sure that I understand precisely what is the meaning of IPC, Instructions and Cycles. Are they counters too ? (expressed not in bytes, but simply counting the number of IPC, instructions and Cycles).

_________________________________________________________________________________________________

Now with pcm-memory.x (called with the external_program parameter), here is an example:

---------------------------------------||---------------------------------------
--             Socket 0              --||--             Socket 1              --
---------------------------------------||---------------------------------------
---------------------------------------||---------------------------------------
---------------------------------------||---------------------------------------
--   Memory Performance Monitoring   --||--   Memory Performance Monitoring   --
---------------------------------------||---------------------------------------
--  Mem Ch 0: Reads (MB/s): 6870.81  --||--  Mem Ch 0: Reads (MB/s): 7406.36  --
--            Writes(MB/s): 1805.03  --||--            Writes(MB/s): 1951.25  --
--  Mem Ch 1: Reads (MB/s): 6873.91  --||--  Mem Ch 1: Reads (MB/s): 7411.11  --
--            Writes(MB/s): 1810.86  --||--            Writes(MB/s): 1957.73  --
--  Mem Ch 2: Reads (MB/s): 6866.77  --||--  Mem Ch 2: Reads (MB/s): 7403.39  --
--            Writes(MB/s): 1804.38  --||--            Writes(MB/s): 1951.42  --
--  Mem Ch 3: Reads (MB/s): 6867.47  --||--  Mem Ch 3: Reads (MB/s): 7403.66  --
--            Writes(MB/s): 1805.53  --||--            Writes(MB/s): 1950.95  --
-- NODE0 Mem Read (MB/s):  27478.96  --||-- NODE1 Mem Read (MB/s):  29624.51  --
-- NODE0 Mem Write (MB/s):  7225.79  --||-- NODE1 Mem Write (MB/s):  7811.36  --
-- NODE0 P. Write (T/s) :    214810  --||-- NODE1 P. Write (T/s):     238294  --
-- NODE0 Memory (MB/s):    34704.75  --||-- NODE1 Memory (MB/s):    37435.87  --
---------------------------------------||---------------------------------------
--                   System Read Throughput(MB/s):  57103.47                  --
--                  System Write Throughput(MB/s):  15037.15                  --
--                 System Memory Throughput(MB/s):  72140.62                  --
---------------------------------------||---------------------------------------

In this example, what are T/s ? What does T stands for ?

Is there anyway to retrieve with these values the amount of data that has been read and written to memory during the program execution ?

_______

Bigger picture, Bigger questions:

Also is there any way with Intel PCM:

The data traffic between Cache L3 and RAM ? If yes, on which functions of the library should I rely ?

_______

Thanks in advance for your help.

0 Kudos
9 Replies
Thomas_W_Intel
Employee
3,188 Views

Jeremie,

  • When you call an external program with pcm-numa, the reported numbers are for the full run. For example, your program executed about 359 billion instructions core 0. About 340 billion CPU cycles were used on core 0 for that. "IPC" is simple the ratio of these two values, i.e. "instructions per cycle". Memory access is measured in bytes. So, total access to local memory was about 107GB, whereas access to remote memory was about 28GB. You need to divide this by the running time in order to compute the bandwidth.
  • pcm-memory is different in that it reports the bandwidth in MB/s. The only exception are partial writes, where the number of transactions per second (T/S) is reported instead of the amount of data.
  • In pcm, the traffic between L3 and DRAM is reported as:
                   cout << "    " << getBytesReadFromMC(sstate1, sstate2) / double(1024ULL * 1024ULL * 1024ULL) <<
                    "    " << getBytesWrittenToMC(sstate1, sstate2) / double(1024ULL * 1024ULL * 1024ULL);
     MC stands for "memory controller". All data that is transferred to and from the memory needs to pass through. This includes data from the cores on the same socket, different sockets, or PCIe devices.

Kind regards

Thomas

0 Kudos
Jeremie_Lagraviere
3,188 Views

Thanks a lot for these explanations :)

I have some precisions to ask:

With pcm-numa.x, I have trouble to understand the word "access"

For example 

Core | IPC  | Instructions | Cycles  |  Local DRAM accesses | Remote DRAM Accesses 
   0   1.06        359 G      340 G      3096 M              1151 M              
 

What does 3096M represent ?

3096 MegaBytes were read by core 0 in the total run ?

3096 MegaBytes were written by core 0 in the total run ?

3096 MegaBytes were read and written by core 0 in the total run ? 

_____

EDIT:

And I have some questions about the output from pcm.x

 Core (SKT) | EXEC | IPC  | FREQ  | AFREQ | L3MISS | L2MISS | L3HIT | L2HIT | L3CLK | L2CLK |  READ | WRITE | TEMP

   0    0     1.05   1.01   1.04    1.18      19 M     26 M    0.27    0.82    0.18    0.01     N/A     N/A     48
   1    0     1.05   1.01   1.04    1.18      18 M     26 M    0.27    0.82    0.18    0.01     N/A     N/A     48
   2    0     1.05   1.02   1.04    1.18      19 M     26 M    0.27    0.82    0.18    0.01     N/A     N/A     49
   3    0     1.04   1.01   1.03    1.18      19 M     26 M    0.27    0.82    0.18    0.01     N/A     N/A     48
   4    0     1.05   1.01   1.04    1.18      19 M     25 M    0.26    0.82    0.18    0.01     N/A     N/A     47
   5    0     1.05   1.01   1.04    1.18      19 M     25 M    0.26    0.82    0.18    0.01     N/A     N/A     48
   6    0     1.05   1.01   1.04    1.18      19 M     26 M    0.27    0.82    0.18    0.01     N/A     N/A     49
   7    0     1.05   1.01   1.04    1.18      19 M     26 M    0.27    0.82    0.18    0.01     N/A     N/A     49
   8    1     1.05   1.02   1.03    1.18      19 M     27 M    0.26    0.82    0.19    0.01     N/A     N/A     55
   9    1     1.05   1.02   1.02    1.18      19 M     27 M    0.26    0.81    0.19    0.01     N/A     N/A     57
  10    1     1.05   1.02   1.03    1.18      19 M     26 M    0.25    0.82    0.19    0.01     N/A     N/A     61
  11    1     1.04   1.03   1.01    1.18      18 M     24 M    0.28    0.83    0.17    0.01     N/A     N/A     53
  12    1     1.05   1.03   1.02    1.18      18 M     25 M    0.26    0.82    0.18    0.01     N/A     N/A     47
  13    1     1.06   1.03   1.03    1.18      19 M     25 M    0.26    0.82    0.18    0.01     N/A     N/A     47
  14    1     1.05   1.03   1.02    1.18      18 M     25 M    0.27    0.83    0.18    0.01     N/A     N/A     52
  15    1     1.06   1.03   1.03    1.18      21 M     28 M    0.25    0.80    0.20    0.01     N/A     N/A     48
  16    0     1.05   1.01   1.04    1.18      19 M     26 M    0.27    0.82    0.18    0.01     N/A     N/A     48
  17    0     1.05   1.01   1.04    1.18      19 M     26 M    0.27    0.82    0.18    0.01     N/A     N/A     49
  18    0     1.04   1.01   1.04    1.18      19 M     27 M    0.27    0.81    0.18    0.01     N/A     N/A     49
  19    0     1.05   1.02   1.03    1.18      20 M     27 M    0.28    0.81    0.19    0.01     N/A     N/A     49
  20    0     1.05   1.01   1.04    1.18      18 M     25 M    0.27    0.82    0.18    0.01     N/A     N/A     47
  21    0     1.05   1.01   1.04    1.18      19 M     26 M    0.27    0.82    0.18    0.01     N/A     N/A     48
  22    0     1.05   1.01   1.04    1.18      19 M     26 M    0.26    0.82    0.18    0.01     N/A     N/A     49
  23    0     1.05   1.01   1.04    1.18      19 M     25 M    0.27    0.82    0.18    0.01     N/A     N/A     48
  24    1     1.05   1.02   1.02    1.18      18 M     25 M    0.29    0.82    0.17    0.01     N/A     N/A     55
  25    1     1.05   1.02   1.02    1.18      18 M     25 M    0.27    0.82    0.18    0.01     N/A     N/A     57
  26    1     1.05   1.03   1.02    1.18      18 M     25 M    0.26    0.82    0.18    0.01     N/A     N/A     60
  27    1     1.05   1.03   1.03    1.18      19 M     26 M    0.27    0.82    0.18    0.01     N/A     N/A     53
  28    1     1.05   1.02   1.03    1.18      18 M     25 M    0.27    0.82    0.18    0.01     N/A     N/A     47
  29    1     1.05   1.03   1.02    1.18      18 M     25 M    0.27    0.83    0.17    0.01     N/A     N/A     47
  30    1     1.05   1.03   1.02    1.18      18 M     25 M    0.28    0.83    0.17    0.01     N/A     N/A     52
  31    1     1.05   1.03   1.02    1.18      18 M     25 M    0.27    0.82    0.18    0.01     N/A     N/A     48
-----------------------------------------------------------------------------------------------------------------------------
 SKT    0     1.05   1.01   1.04    1.18     308 M    420 M    0.27    0.82    0.18    0.01    213.24    58.78     45
 SKT    1     1.05   1.03   1.02    1.18     304 M    414 M    0.27    0.82    0.18    0.01    206.95    56.91     47
-----------------------------------------------------------------------------------------------------------------------------
 TOTAL  *     1.05   1.02   1.03    1.18     612 M    835 M    0.27    0.82    0.18    0.01    420.19    115.69     N/A

 Instructions retired:  622 G ; Active cycles:  611 G ; Time (TSC):   18 Gticks ; C0 (active,non-halted) core residency: 87.17 %

 C1 core residency: 0.35 %; C3 core residency: 0.00 %; C6 core residency: 0.00 %; C7 core residency: 12.47 %;
 C2 package residency: 2.69 %; C3 package residency: 0.00 %; C6 package residency: 8.68 %; C7 package residency: 0.00 %;

 PHYSICAL CORE IPC                 : 2.04 => corresponds to 50.92 % utilization for cores in active state
 Instructions per nominal CPU cycle: 2.10 => corresponds to 52.48 % core utilization over time interval

Intel(r) QPI data traffic estimation in bytes (data traffic coming to CPU/socket through QPI links):

              | 
----------------------------------------------------------------------------------------------
 SKT    0     |  
 SKT    1     |  
----------------------------------------------------------------------------------------------
Total QPI incoming data traffic:    0       QPI data traffic/Memory controller traffic: 0.00

Intel(r) QPI traffic estimation in bytes (data and non-data traffic outgoing from CPU/socket through QPI links):

              | 
----------------------------------------------------------------------------------------------
 SKT    0     |  
 SKT    1     |  
----------------------------------------------------------------------------------------------
Total QPI outgoing data and non-data traffic:    0  

----------------------------------------------------------------------------------------------
 SKT    0 package consumed 652.74 Joules
 SKT    1 package consumed 638.81 Joules
----------------------------------------------------------------------------------------------
 TOTAL:                    1291.55 Joules

----------------------------------------------------------------------------------------------
 SKT    0 DIMMs consumed 205.14 Joules
 SKT    1 DIMMs consumed 253.05 Joules
----------------------------------------------------------------------------------------------
 TOTAL:                  458.19 Joules

 

I have troubles understanding the meaning of these values:

L3MISS and L2MISS: what is the unit there ? MegaBytes ?

READ and WRITE: the global values are 420 GB read from memory and 115GB written to memory ? Am I right ?

0 Kudos
McCalpinJohn
Honored Contributor III
3,188 Views

I find it very helpful to use a code with a known memory access pattern (such as STREAM) to help understand reports such as these.

Using a version of STREAM compiled with STREAM_ARRAY_SIZE=20000000, the code reports that each array occupies 152.6 MiB.  The code was compiled with NTIMES=10.   Across the 4 kernels there are 6 reads per loop iteration and 4 writes per loop iteration, plus a few additional array reads (at least 2) and a few array writes (at least 4) in the initialization and validation code.   So the estimated traffic comes out to

20000000 elements * 8 Bytes/element * (2 reads + 4 writes + 10 iterations*(6 reads + 4 writes))  = 1.696e10 Bytes = 265 million cache lines

The pcm-numa.x output showed 264 million local reads accesses when I bound both the code and data to socket 0:

./pcm-numa.x -- numactl --membind=0 --physcpubind=3 ./stream_5-11.snb.uni.sse

Core | IPC  | Instructions | Cycles  |  Local DRAM accesses | Remote DRAM Accesses
   0   0.25        502 K     2042 K        11 K              1806                
   1   0.37       7130 K       19 M      9470                1623                
   2   0.20       2813 K       13 M        32 K              2915                
   3   0.42       2794 M     6599 M       264 M               482 K              
   4   0.24        304 K     1273 K      8163                 969                
   5   0.39        245 K      629 K      2880                 451                
   6   0.43        202 K      473 K      2076                 273                
   7   0.20        396 K     1987 K        17 K              1773                
   8   0.62       1130 K     1835 K       351                 531                
   9   1.10       6722 K     6118 K      1411                1616                
  10   0.45        323 K      714 K       118                  88                
  11   0.52        427 K      823 K        96                 113                
  12   0.50        390 K      779 K       105                 327                
  13   0.43        257 K      597 K        47                  31                
  14   0.46        350 K      755 K        83                  79                
  15   0.26       1827 K     7144 K      1149                1037                
-------------------------------------------------------------------------------------------------------------------
   *   0.42       2817 M     6658 M       264 M               495 K              

 

The pcm-numa.x output showed 265 million remote reads accesses when I bound the data to socket 0 and the code to socket 1:

./pcm-numa.x -- numactl --membind=0 --physcpubind=11 ./stream_5-11.snb.uni.sse

Core | IPC  | Instructions | Cycles  |  Local DRAM accesses | Remote DRAM Accesses
   0   0.51        743 K     1468 K       435                 589                
   1   0.38         13 M       34 M      1864                1830                
   2   0.26       3626 K       13 M       679                 438                
   3   0.33        255 K      785 K       137                  55                
   4   0.40        469 K     1178 K       210                  95                
   5   0.33        250 K      756 K        94                  15                
   6   0.39        372 K      959 K       140                 220                
   7   0.42        540 K     1279 K       420                 150                
   8   0.23        840 K     3657 K      7800                  14 K              
   9   0.37       2352 K     6274 K        11 K                18 K              
  10   0.39        732 K     1860 K      3610                4555                
  11   0.25       2802 M       11 G        16 K               265 M              
  12   0.49        311 K      631 K      1280                 572                
  13   0.36        268 K      741 K      1825                1971                
  14   0.33        338 K     1022 K      1785                2767                
  15   0.25       1938 K     7707 K      6942                4008                
-------------------------------------------------------------------------------------------------------------------
   *   0.25       2828 M       11 G        55 K               266 M              

In both cases the bandwidth was assigned to the processor that the STREAM process was bound to, and in each case the result was within 0.5% of the expected value of 265M cache lines.

0 Kudos
Jeremie_Lagraviere
3,188 Views

Wow this is very interesting :) Thanks a lot !

Just to be sure:

Is it correct to say that Local DRAM accesses and Remote DRAM accesses are expressed in "number of cache lines" ? Am I right ? Or is there a more correct way to interpret these values ?

 

0 Kudos
McCalpinJohn
Honored Contributor III
3,188 Views

From the STREAM comparison, it is clear that these "accesses" include both read and write transactions, measured in 64 Byte cache lines. 

The counts presented above are clearly summed across all the DRAM channels on each chip. 

Other tools may provide different aggregations of the data.  For example, my tools make no attempt to assign traffic to cores, but they present reads and writes separately, and provide data for each of the DRAM channels on each chip separately.   I don't know if any of the PCM tools provide similar alternate presentations of the data.

0 Kudos
Spivoler
Beginner
3,188 Views

John McCalpin wrote:

From the STREAM comparison, it is clear that these "accesses" include both read and write transactions, measured in 64 Byte cache lines. 

The counts presented above are clearly summed across all the DRAM channels on each chip. 

Other tools may provide different aggregations of the data.  For example, my tools make no attempt to assign traffic to cores, but they present reads and writes separately, and provide data for each of the DRAM channels on each chip separately.   I don't know if any of the PCM tools provide similar alternate presentations of the data.

Hi Dr. McCalpin,

I am also using these two tools (pcm-numa and pcm-memory). When I test them using STTREAM, I found they are reporting different total memory bandwidth. The way how I calculate the total memory bandwidth is:

using pcm-numa: (total local dram + total remote dram) * 64 / time

using pcm-memory: using the number "system memory throughput: xxx"

These two numbers are not even close to each other.... Do you know why? Thanks.

 

0 Kudos
McCalpinJohn
Honored Contributor III
3,188 Views

It is hard to comment without knowing what sorts of numbers you are seeing....

STREAM reports bandwidth computed with internal timers, pcm-memory reports bandwidth computed with whole-program timers, and pcm-numa reports local and remote memory accesses.  

Several things have to be done correctly for STREAM results to be accurate with respect to the memory traffic reported by pcm-numa or the memory bandwidth reported by pcm-memory:

  • STREAM must be compiled with a sufficiently large array size -- each array must be much larger than the sum of the L3 caches that get used in the run.  I recommend adding "-DSTREAM_ARRAY_SIZE=80000000" to the compile line to get 600 MiB per array.  This should be enough on most systems, without running into 2 GiB addressing limits or long run-times.
  • STREAM should be compiled with streaming stores enabled.  This will be the default with the Intel compilers in most cases, but can be helped with the compilation option "-opt-streaming-stores always".
    • Without streaming stores, the actual data motion to/from memory is 50% higher than assumed by STREAM for the Copy and Scale kernels and 33% higher than assumed by STREAM for the Add and Triad kernels.
  • STREAM must be run with thread binding on NUMA systems.  With the Intel compilers, this is controlled by the KMP_AFFINITY environment variable. 
    • Without thread binding, the OS will move threads away from their data, and may not assign threads to all of the physical cores.   When either of these things happen, STREAM will waste a lot of time in OpenMP barriers, waiting for slow threads to catch up.
    • The specific setting of KMP_AFFINITY depends on how many sockets are in the system, how many sockets you want to use for the test, and whether or not HyperThreading is enabled.
      • The easiest case is using all the cores, e.g., on a 32-core system:
        • export KMP_AFFINITY="verbose,compact"
        • export OMP_NUM_THREADS=32
      • If HyperThreading is enabled, then a system with 32 "logical processors" has only 16 physical cores, and the results will be slightly better with:
        • export KMP_AFFINITY="verbose,scatter"
        • export OMP_NUM_THREADS=16
    • Some tests of non-local access are also possible by combining KMP_AFFINITY and "numactl".
      • E.g., assume 2 sockets with 8 cores/socket and HyperThreading enabled
        • export KMP_AFFINITY="verbose,compact,1"
        • export OMP_NUM_THREADS=8
        • numactl --membind=1 ./stream
  • With the default value of NTIMES=10, STREAM spends a fair amount of time in scalar code (array initialization plus results validation), so the bandwidth reported by STREAM will differ from the bandwidth computed for the whole program. 
    • Adding "-DNTIMES=100" to the compile line will reduce the relative amount of scalar code and should bring the values reported by STREAM closer to the values reported by pcm-memory.
  • STREAM reports the best results for each test case, so if the machine is busy, the bandwidths reported by STREAM may be quite a bit faster than the whole-program average bandwidths computed by pcm-memory.
0 Kudos
Rob_E_
Beginner
3,188 Views

Another question regarding pcm-numa, 

As you know, pcm-numa reports on per-core “DRAM” traffic for both local and remote DRAM. For a “remote DRAM access”, would this also count cases when the needed data is in the remote L3 cache? In general, I assume that QPI and numa makes the combined local and remote L3 cache appear as a single unified L3 cache. Is that correct? Likewise, it makes the combined local and remote DRAM appear as a single unified DRAM. So, we count local L3 and DRAM accesses as being different. However, there is no count for remote L3 cache access. Does this imply that a core cannot access remote L3 directly without going to remote DRAM? Or, does it mean that we simply don’t count remote L3 cache accesses, or are they bundled into the remote DRAM access count? (Hope this question is not too rambling.)

Thanks,

Rob

0 Kudos
McCalpinJohn
Honored Contributor III
3,188 Views

I don't know how pcm-numa is implemented, but from the output it looks like it only reports DRAM accesses.

How this relates to cases that may hit in a remote L3 is more complex, and depends on both unpublished implementation details and the specifics of the timing of the operations (including queuing delays).   It also depends on how the data is being gathered -- in the example below, I suggest how a counter in the DRAM controller would be expected to give different results than the OFFCORE_RESPONSE counter configured to count local and remote DRAM accesses.

Consider a load (or store) that misses in the local L1, L2, and L3, and whose address is mapped to a memory controller on the same chip as the requesting core:

  • Local DRAM access counts from the memory controller may include "speculative" loads of the cache line that are issued in parallel with the snoop request to the other (NUMA) nodes.  If the data is found "dirty" in another node, the local DRAM access will often complete (and be counted), but the data from local memory will be discarded and the data from the remote cache will be used instead.   Depending on the protocol and the timing of the transactions, it is possible that the local DRAM read will still be in a queue awaiting execution when the remote L3 snoop returns a "HitM" message.  In this case the local DRAM read might be cancelled (and not counted).    
  • Local DRAM access counts from the OFFCORE_RESPONSE counter should not include such speculative accesses, since this counter reports where the data was actually found - not whether the memory controller issued a speculative read on the line.

My guess is that pcm-numa reads the memory controller counters, but I have not looked at it in detail -- I find that implementing the tools myself is the only way that I have a hope of really understanding what is being counted.

0 Kudos
Reply