Solved: How to count L1 cache miss/hit on Intel Haswell 4790?

Max_Rafiandy · ‎02-20-2016

With intel PCM I can count L2 and L3 cache misses and hits. however, it cannot count for L1 cache. how to count L1 events on haswell 4790?

I have read about this topic at https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring/topic/548988 but I was confused about what was discussed there.

McCalpinJohn · ‎02-29-2016

It looks like we only have a partial set of results, so it is hard to say what is going on....

If these performance counter events are working correctly, there are many combinations of counters that should match. Examples include:

MEM_UOPS_RETIRED.ALL_LOADS = MEM_LOAD_UOPS_RETIRED.L1_HIT + MEM_LOAD_UOPS_RETIRED.HIT_LFB + MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOAD_UOPS_RETIRED.L3_HIT + MEM_LOAD_UOPS_RETIRED.L3_MISS
MEM_LOAD_UOPS_RETIRED.L1_MISS = MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOAD_UOPS_RETIRED.L2_MISS
MEM_LOAD_UOPS_RETIRED.L2_MISS = MEM_LOAD_UOPS_RETIRED.L3_HIT + MEM_LOAD_UOPS_RETIRED.L3_MISS

These could be combined, for example:

MEM_LOAD_UOPS_RETIRED.L1_MISS = MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOAD_UOPS_RETIRED.L3_HIT + MEM_LOAD_UOPS_RETIRED.L3_MISS

If the combinations that should match actually do match, then you will have more confidence that these events are correct, and you may choose to use the equations to derive some of the values while measuring others.

There have been, unfortunately, cases for which the counts matched the equations above, but the results were still wrong (e.g., these same counters on Sandy Bridge processors). The results I have reviewed so far on Haswell look good, but I have not run a full set of controlled tests to be confident in the accuracy of the counts.

View solution in original post

McCalpinJohn · ‎02-22-2016

It looks like PCM has some level of support for custom core event configuration. There is a fair amount of documentation in the PCM "cpucounters.h" file and in the "cpucounters.cpp" file in the main directory of the PCM distribution. I have not tried to modify PCM in this way, so I don't know whether this works as you might expect....

Counting hits and misses at the L1 data cache requires careful attention to the details of the events and how instructions, cache transactions, and performance counter events interact.

For loads, the performance counter event MEM_LOAD_UOPS_RETIRED is an obvious choice.
- This only counts load instructions, not store instructions and not hardware prefetches.
- It might count software prefetch loads -- this is not clear from the documentation.
- With respect to the L1 cache, you will want to look at three categories:
  - MEM_LOAD_UOPS_RETIRED.L1_HIT (Event 0xD1, Umask 0x01) -- counts every load that hits in the L1.
  - MEM_LOAD_UOPS_RETIRED.L1_MISS (Event 0xD1, Umask 0x08) -- counts loads that miss in the L1, but only if there is not already an outstanding miss to that that same cache line being tracked in the LFB (Line Fill Buffer).
  - MEM_LOAD_UOPS_RETIRED.HIT_LFB (Event 0xD1, Umask 0x40) -- counts loads that miss in the L1, but for which there is already an outstanding request for the same cache line in the LFB.
- The relationship between these three counts depends on the timing of the loads as well as the number of loads per cache line (which is a function of the number of bits being loaded by each load operation).
- The sum of the three events above should match MEM_UOPS_RETIRED.ALL_LOADS (Event 0xD0, Umask 0x81). Sometimes there are bugs in the performance counters, so it is a good idea to cross-check events whenever possible.
An alternate approach is to count:
- All Loads: MEM_UOPS_RETIRED.ALL_LOADS (Event 0xD0, Umask 0x81)
- All Stores: MEM_UOPS_RETIRED.ALL_STORES (Event 0xD0, Umask 0x82)
- All cache lines transferred into the L1 cache: L1D.REPLACEMENT (Event 0x51, Umask 0x01)
- The difficulty with this approach is that you may not know how many loads there are supposed to be per cache line and/or how many stores there are supposed to be per cache line. The number changes depending on the variable sizes and (if applicable) details of how the compiler vectorized the code. E.g., for 32-bit variables, there can be anywhere between 16 loads per cache line (if loads are 32-bits each) and 2 loads per cache line (if loads are 256-bit vector operations).
For stores, there is no event directly analogous to MEM_LOAD_UOPS_RETIRED, but there are plenty of other events that provide information.
- The total number of store operations is given by MEM_UOPS_RETIRED.ALL_STORES (Event 0xD0, Umask 0x82)
- You can subtract the L1 cache refills due to load misses from the total to get an estimate of the L1 cache refills due to store misses using L1D.REPLACEMENT (Event 0x51, Umask 0x01) minus MEM_LOAD_UOPS_RETIRED.L1_MISS (Event 0xD1, Umask 0x08)
- You can also get the number of cache lines with stores that missed in the L1 cache from L2_RQSTS.ALL_RFO (Event 0x24, Umask 0xE2). Note that multiple store instructions can map to the same cache line, but there is no explicit counter for these extra stores that is analogous to the MEM_LOAD_UOPS_RETIRED.HIT_LFB (Event 0xD1, Umask 0x40) event for loads. This same event can also be counted using L2_TRANS.RFO (Event 0xF0, Umask 0x02). The L2_RQSTS.* and L2_TRANS.* events should be similar, but one might include retried transactions. If that is the case, then the smaller value should be a good measurement of the traffic, while the larger value provides indirect information about how busy the L2 cache is.

Looking through Section 19.4 of Volume 3 of the Intel Architectures Software Developer's Manual (document 325384, revision 057) should suggest a number of additional approaches for looking at L1 misses. The best choice depends on whether you need to track reads and writes separately, whether you know the sizes of the read and write operations, whether the code has any streaming stores, whether you have the time and expertise to generate test cases to validate or understand the performance counter results, etc....

Roman_D_Intel · ‎02-22-2016

Hi,

the latest Intel PCM version has an utility to query events supported by your processor (pmu-query.py). You can monitor events you need using the pcm-core utility.

Thanks,

Roman

Max_Rafiandy · ‎02-28-2016

Hi, Roman. Hi, John. Thanks for your best response.

I have been carrying out some trials by counting the MEM_LOAD_UOPS_RETIRED.L1_HIT(Event 0xD1, Umask 0x01) and MEM_LOAD_UOPS_RETIRED.L1_MISS(Event 0xD1, Umask 0x08) with build_events method on pcm-core.cpp. And Here is my result:

##################################################################
SUMMARY           : 
Number Threads    : 8
Serial            : 0.09s
Parallel          : 0.05s
Speedup           : 1.68

Core | IPC | Instructions  | Cycles  | Event0  | Event1  | Event2  | Event3       L2 Hit    L2 Miss   L3 Hit    L3 Miss
   0   1.33        255 M      192 M        99 M     524 K       0         0         0        99 M     524 K      99 M    
   1   0.86        131 M      152 M        49 M     403 K       0         0         0        49 M     403 K      49 M    
   2   0.84        137 M      164 M        50 M     504 K       0         0         0        51 M     504 K      50 M    
   3   0.98        130 M      133 M        49 M     399 K       0         0         0        49 M     399 K      49 M    
   4   0.87        130 M      150 M        49 M     401 K       0         0         0        49 M     401 K      49 M    
   5   0.83        134 M      162 M        49 M     648 K       0         0         0        50 M     648 K      49 M    
   6   0.85        130 M      154 M        48 M     423 K       0         0         0        49 M     423 K      48 M    
   7   1.62        255 M      158 M       100 M     487 K       0         0         0       100 M     487 K     100 M    
-------------------------------------------------------------------------------------------------------------------
   *   1.03       1306 M     1269 M       496 M    3793 K       0         0         0       500 M    3793 K     496 M 
##################################################################

I'm confused about this result. The Event0 was the MEM_LOAD_UOPS_RETIRED.L1_HIT, and the Event0 was MEM_LOAD_UOPS_RETIRED.L1_Miss. Counting L2 and L3 (hit/miss) by calling getL2CacheHits(), getL2CacheMisses(), getL3CacheHits(), getL3CacheMisses() provided by cpucounter.h. Why are the result of L1 load hits equals to L2 Miss and L1 Load misses equals to L3 Hit?

Here is the buld_event method I copied from pcm-core.cpp into my code:

void build_event(const char * argv, EventSelectRegister *reg, int idx)
{
    char *token, *subtoken, *saveptr1, *saveptr2;
    char name[EVENT_SIZE], *str1, *str2;
    int j, tmp;
    uint64 tmp2;
    reg->value = 0;
    reg->fields.usr = 1;
    reg->fields.os = 1;
    reg->fields.enable = 1;

    memset(name,0,EVENT_SIZE);
    strncpy(name,argv,EVENT_SIZE-1);

    for (j = 1, str1 = name; ; j++, str1 = NULL)
    {
        token = strtok_r(str1, "/", &saveptr1);
        if (token == NULL)
            break;
        if(strncmp(token,"cpu",3) == 0)
            continue;

        for (str2 = token; ; str2 = NULL)
        {
            tmp = -1;
            subtoken = strtok_r(str2, ",", &saveptr2);
            if (subtoken == NULL)
                break;
            if(sscanf(subtoken,"event=%i",&tmp) == 1)
                reg->fields.event_select = tmp;
            else if(sscanf(subtoken,"umask=%i",&tmp) == 1)
                reg->fields.umask = tmp;
        }
    }
    events[idx].value = reg->value;
}

And call it by:

build_event((char *)"umask=0x01,event=0xd1",&regs[0],0);  // L1 LOAD HIT
build_event((char *)"umask=0x08,event=0xd1",&regs[1],1);  // L1 LOAD MISS

Best Regards,

Max Rafiandy

McCalpinJohn · ‎02-29-2016

It looks like we only have a partial set of results, so it is hard to say what is going on....

If these performance counter events are working correctly, there are many combinations of counters that should match. Examples include:

MEM_UOPS_RETIRED.ALL_LOADS = MEM_LOAD_UOPS_RETIRED.L1_HIT + MEM_LOAD_UOPS_RETIRED.HIT_LFB + MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOAD_UOPS_RETIRED.L3_HIT + MEM_LOAD_UOPS_RETIRED.L3_MISS
MEM_LOAD_UOPS_RETIRED.L1_MISS = MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOAD_UOPS_RETIRED.L2_MISS
MEM_LOAD_UOPS_RETIRED.L2_MISS = MEM_LOAD_UOPS_RETIRED.L3_HIT + MEM_LOAD_UOPS_RETIRED.L3_MISS

These could be combined, for example:

MEM_LOAD_UOPS_RETIRED.L1_MISS = MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOAD_UOPS_RETIRED.L3_HIT + MEM_LOAD_UOPS_RETIRED.L3_MISS

If the combinations that should match actually do match, then you will have more confidence that these events are correct, and you may choose to use the equations to derive some of the values while measuring others.

There have been, unfortunately, cases for which the counts matched the equations above, but the results were still wrong (e.g., these same counters on Sandy Bridge processors). The results I have reviewed so far on Haswell look good, but I have not run a full set of controlled tests to be confident in the accuracy of the counts.

Max_Rafiandy · ‎03-02-2016

Hi, John. It works. Thank you very much.

cianfa72 · ‎04-04-2016

With respect to the L1 cache, you will want to look at three categories:

MEM_LOAD_UOPS_RETIRED.L1_HIT (Event 0xD1, Umask 0x01) -- counts every load that hits in the L1.

MEM_LOAD_UOPS_RETIRED.L1_MISS (Event 0xD1, Umask 0x08) -- counts loads that miss in the L1, but only if there is not already an outstanding miss to that that same cache line being tracked in the LFB (Line Fill Buffer).

MEM_LOAD_UOPS_RETIRED.HIT_LFB (Event 0xD1, Umask 0x40) -- counts loads that miss in the L1, but for which there is already an outstanding request for the same cache line in the LFB.

as far as I understand reading this forum, for a normal (allocating) cache miss (due to demand or HW prefetch loads involving one or multiple bytes belonging to the 64B cache line) a line fill buffer (LFB) is allocated to track the outstanding cache line miss during the time interval needed to load from the memory hierarchy (L2 cache, LLC cache or DRAM memory banks). We know that for a normal (allocating) cache miss a cache line has to be allocated into the L1 Dcache to store the incoming cache line being loaded.

Now my doubt is: when actually is the cache line allocated into L1 DCache - eventually evicting an existing one in order to make room for the new cache line being loaded ? I guess it will be allocated only when the new cache line loading complete in order to maximize the probability others concurrent loads can hit existing cache lines.

Does it make sense ? Thanks

McCalpinJohn · ‎04-04-2016

I don't think that there is any way to answer that question without detailed knowledge of the implementation. Most questions about "when" are much trickier that you might expect when you start looking at actual implementation details.

Generally the "victim" line is either marked invalid (if clean) or written back to the next level of the cache hierarchy as soon as possible. This allows the eviction transaction to be completed before the new cache line arrives and needs to get written to the same set of SRAM bits that the victim used. Transactions that require only writes to the tags (e.g., invalidate) and transactions that require access to both the tag and data arrays (e.g., dirty writebacks) can have fairly different implementation timing.

Invalidating a victim line or writing a dirty victim line out to the next level of the memory hierarchy does not need to inhibit concurrency. It clearly blocks access to the victim line, but all of the other rows and sets of the cache remain (theoretically) accessible. There will often be conflicts for the (very small number of) SRAM ports, but the latencies of transactions to other rows and sets can be overlapped with the eviction.

cianfa72 · ‎04-05-2016

Thanks John, regarding this other point:

An alternate approach is to count:

All Loads: MEM_UOPS_RETIRED.ALL_LOADS (Event 0xD0, Umask 0x81)

All Stores: MEM_UOPS_RETIRED.ALL_STORES (Event 0xD0, Umask 0x82)

All cache lines transferred into the L1 cache: L1D.REPLACEMENT (Event 0x51, Umask 0x01)

The difficulty with this approach is that you may not know how many loads there are supposed to be per cache line and/or how many stores there are supposed to be per cache line. The number changes depending on the variable sizes and (if applicable) details of how the compiler vectorized the code. E.g., for 32-bit variables, there can be anywhere between 16 loads per cache line (if loads are 32-bits each) and 2 loads per cache line (if loads are 256-bit vector operations).

Does it exist a way or some "trick" to work out the number of load/store's occurrences per cache line ?

McCalpinJohn · ‎04-05-2016

I don't know of any way to determine the distribution of sizes of loads other than looking at the assembly code and counting.

This would be an appropriate target for compiler instrumentation or perhaps instrumentation via binary re-writing.

cianfa72 · ‎04-06-2016

I don't know of any way to determine the distribution of sizes of loads other than looking at the assembly code and counting.

just to be sure I was properly explained: I was referring to a summary of the number of times a load or store hit or miss a fixed cache line and not about load/store size distribution (as statistics of number of bytes of each load/store)....

TimP · ‎04-06-2016

John McCalpin wrote:

I don't know of any way to determine the distribution of sizes of loads other than looking at the assembly code and counting.

This would be an appropriate target for compiler instrumentation or perhaps instrumentation via binary re-writing.

opt-report4 displays some of the relevant information from compilation. Also relevant would be the relative number of executions of the various remainder branches.

Advisor flags the case where significant time is spent in remainder loops, but you still need to view assembly to see which load instructions are involved.