Software Tuning, Performance Optimization & Platform Monitoring
Discussion regarding monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform updating.
1711 Discussions

PMC: L1 load UOPS cache miss rate is lower than L2 load UOPS cache miss rate

thormund
Beginner
892 Views

Hello,

 

I'm manually reading 6 PMCs in the Linux kernel (that is, I'm not using perf) on a 4th gen mobile i7. The events I'm looking at are:

- MEM_LOAD_UOPS_RETIRED.L1_HIT (evsel 0XD1, umask 0x01)

- MEM_LOAD_UOPS_RETIRED.L1_MISS (evsel 0XD1, umask 0X08)

- MEM_LOAD_UOPS_RETIRED.L2_HIT (evsel 0XD1, umask 0x02)

- MEM_LOAD_UOPS_RETIRED.L2_MISS (evsel 0XD1, umask 0X010)

- MEM_LOAD_UOPS_RETIRED.L3_HIT (evsel 0XD1, umask 0x04)

- MEM_LOAD_UOPS_RETIRED.L3_MISS (evsel 0XD1, umask 0X020)

 

I read only 2 PMCs per kernel compilation, for ex the L1 related ones. By looking at the data I collected from my experiments, I've noticed something I'm not fully understanding: the cache miss rate for the L1 cache is far lower than the cache miss rate of the L2 cache. What I see is a range of cache miss rate of 0.5-7% for the L1, depending on the applications I'm looking while testing, while for the L2 I have a cache miss rate of 40-60%.

I've also noticed that the L3 cache miss rate is roughly similar to the one of the L2, which is something I was not expecting either.

I've looked several time at the PMCs configuration and everything seems ok.
Do you know why I'm getting such values? I was expecting to see a much higher cache miss rate in the L1 compared to L2, as well as a higher cache miss rate in the L2 compared to the L3, although it doesn't seems the case.

Any help would be appreciated, thank you in advance!

0 Kudos
1 Solution
McCalpinJohn
Honored Contributor III
868 Views

This is only confusing if you define your ratios in a confusing way. 

If you define the "miss rate" as MEM_LOAD_UOPS_RETIRED.Lx_MISS/(MEM_LOAD_UOPS_RETIRED.Lx_HIT+MEM_LOAD_UOPS_RETIRED.Lx_MISS) for each of the three cache levels, then your results are not surprising for several reasons....

  1. The number of MEM_LOAD_UOPS_RETIRED.L1_HIT will vary depending on the size of the data elements loaded.  
    • For the 4th-generation Core i7, you can have anywhere from 64 loads per cache line (for single-byte variables) to 2 loads per cache line (for 256-bit AVX2 vector register loads).
    • Each load (of any size) that hits in the L1 cache will increment MEM_LOAD_UOPS_RETIRED.L1_HIT.
    • The converse is not true -- only one load per cache line will increment the MEM_LOAD_UOPS_RETIRED.L1_MISS counter -- any additional loads in the same cache line that are executed before the line is installed in the cache will increment MEM_LOAD_UOPS_RETIRED.HIT_LFB -- indicating that these later loads were assigned to the same "Line Fill Buffer" that was allocated for the first miss to that cache line.
  2. It is sometimes important to distinguish between "compulsory" cache misses and "capacity" or "conflict" misses.  
    • "Compulsory" cache misses occur the first time an element is loaded from memory.  This will miss in all levels of the cache.
    • "Capacity" and/or "Conflict" misses can only occur on subsequent accesses.
    • For example, if you load some data that fits into the L1 Data Cache, then repeatedly access that data, the L1 Data Cache will have a high hit rate (because of many accesses), while the L2 and L3 caches will have very high miss rates (because the L1 Data Cache "intercepted" the loads to the addresses that those caches are holding, leaving the initial "compulsory" misses dominating.

View solution in original post

0 Kudos
2 Replies
McCalpinJohn
Honored Contributor III
869 Views

This is only confusing if you define your ratios in a confusing way. 

If you define the "miss rate" as MEM_LOAD_UOPS_RETIRED.Lx_MISS/(MEM_LOAD_UOPS_RETIRED.Lx_HIT+MEM_LOAD_UOPS_RETIRED.Lx_MISS) for each of the three cache levels, then your results are not surprising for several reasons....

  1. The number of MEM_LOAD_UOPS_RETIRED.L1_HIT will vary depending on the size of the data elements loaded.  
    • For the 4th-generation Core i7, you can have anywhere from 64 loads per cache line (for single-byte variables) to 2 loads per cache line (for 256-bit AVX2 vector register loads).
    • Each load (of any size) that hits in the L1 cache will increment MEM_LOAD_UOPS_RETIRED.L1_HIT.
    • The converse is not true -- only one load per cache line will increment the MEM_LOAD_UOPS_RETIRED.L1_MISS counter -- any additional loads in the same cache line that are executed before the line is installed in the cache will increment MEM_LOAD_UOPS_RETIRED.HIT_LFB -- indicating that these later loads were assigned to the same "Line Fill Buffer" that was allocated for the first miss to that cache line.
  2. It is sometimes important to distinguish between "compulsory" cache misses and "capacity" or "conflict" misses.  
    • "Compulsory" cache misses occur the first time an element is loaded from memory.  This will miss in all levels of the cache.
    • "Capacity" and/or "Conflict" misses can only occur on subsequent accesses.
    • For example, if you load some data that fits into the L1 Data Cache, then repeatedly access that data, the L1 Data Cache will have a high hit rate (because of many accesses), while the L2 and L3 caches will have very high miss rates (because the L1 Data Cache "intercepted" the loads to the addresses that those caches are holding, leaving the initial "compulsory" misses dominating.
0 Kudos
thormund
Beginner
857 Views

That's how I've defined the cache miss ratio. Thank you for the answer!

0 Kudos
Reply