Measure cache miss on Broadwell DE using PCM

Hongjun_R_ · ‎01-12-2017

Hello,

I am using pcm.x to monitor system cache miss on Broadwell DE , but the results really makes me feel confused.

when system boot, pcm show L3 HIT around 20%, which I think low compared with a IPC of 1.5

I have read some PCM source code, finding that it caculate L3HIT ratio with:

(MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_NONE + EVENT_ID:D2-UMASK:07) /

               (LONGEST_LAT_CACHE.MISS + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_NONE + EVENT_ID:D2-UMASK:07)

               EVENT_ID:D2-UMASK:07 I didn't any description about it in program-manual

               so a system's l3 hit ratio 20% seems normal, how can I measure it ?

L2 HIT in a core (say, core 7 logical core 14, 15) seems strange(24% and 66%), for they runs exactly the same code.

and when logic core 15 process some task, logical core 14 does nothing, the two L2 HIT exchanges.

   0           0.01   0.68        520 K    625 K    0.17    0.23
   1           0.00   0.67         40 K     48 K      0.18    0.53
                                 2           0.00   0.35        123 K    160 K    0.23    0.28
                                 3           2.81   2.81         16 K     20 K      0.17    0.59
                                 4           1.51   1.51       107 K    143 K    0.25    0.22
                                 5           1.59   1.59         24 K     28 K      0.14    0.61
                                 6           1.51   1.51        101 K    137 K    0.26    0.22
                                 7           1.54   1.54         17 K     21 K      0.16    0.64
   8           1.46   1.46       102 K    138 K     0.26    0.20
   9           1.50   1.50        17 K     21 K       0.18    0.52
   10          1.54   1.54        102 K    138 K 0.26    0.20
   11          1.58   1.58         17 K     21 K      0.17    0.58
   12          1.56   1.56        102 K    138 K    0.26    0.23
                               13          1.64   1.64        17 K     21 K       0.17    0.68
                               14          1.55   1.55        100 K    135 K    0.26    0.24
                               15          1.58   1.58         17 K     21 K      0.17    0.66
                              ----------------------------------------------------------------------------
                              SKT        1.34   1.64      1430 K   1823 K    0.22    0.30
                             ----------------------------------------------------------------------------
                            TOTAL    1.34   1.64     1430 K   1823 K    0.22    0.30

THANKS.

McCalpinJohn · ‎01-12-2017

Event 0xD2, Umask 0x7 is the combination of three events described in Table 19-5 of Volume 3 of the Intel Architectures Software Developer's Manual. Event 0xD2 is MEM_LOAD_UOPS_L3_HIT_RETIRED. The Umask 0x7 corresponds to the logical "OR" of

Umask 0x01 XSNP_MISS
Umask 0x02 XSNP_HIT
Umask 0x04 XSNP_HITM

The event MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_NONE is Event 0xD2, Umask 0x08, so the sum of these two events corresponds to all of the L3 load hit cases.

It is important to note that Intel's definition of "MEM_LOAD_UOPS" includes only demand loads, and not hardware prefetch loads. Under normal circumstances you can expect to see a large number of L3 hits even on data that has not been loaded previously, because the hardware prefetchers are able to fetch the data into the L3 cache before the load gets there. Sometimes the hardware prefetchers will bring the data all the way into the L2 cache, in which case you will see an L2 hit and no L3 access at all, but the amount of data brought into the L2 vs the L3 is dynamically variable and not easy to understand or predict. The timing of the prefetches relative to the eventual execution of the demand loads is also difficult to understand or predict -- sometimes the prefetch will get the data into the cache before the load, sometimes not, so there is no obvious "right answer" for what that hit rate should be for data that is prefetchable.

The anomalies with counts across logical processors sharing the same physical core may be real or may be due to bugs in the performance counters. I did not see anything in the Broadwell processor errata that suggested that the HyperThreading-based bugs previously present in Sandy Bridge and Ivy Bridge might have been carried forward into Haswell/Broadwell, but the errata documents don't contain all the known performance counter bugs. (I don't know of any such list -- I try to keep track of this myself, but don't have the funding to be organized about it....)