Solved: Re: Measuring average outstanding L1 and L2 misses on Broadwell (Xeon E5-2630 v4)

Ted_R_Cooper · ‎08-03-2020

I'm trying to use PAPI to quantify average outstanding L1 and L2 misses, that is, how many of the available miss status handling registers (seems like Intel calls these "fill buffers", at least for L1) are in use on average, and differentiate between this for L1, L2, and any other levels I can measure. I'm attempting to compare to the kind of analysis in this paper (https://dspace.mit.edu/bitstream/handle/1721.1/125080/1807.01624.pdf?sequence=2&isAllowed=y), in particular, the "memory-level parallelism" metric described in Section 8.3.2 as "average outstanding L2 misses... (including) speculative and prefetch requests". My code does software prefetching using PREFETCHNTA, and I'd like to count it when doing so uses a miss status handling register/fill buffer.

For L1, I think I know how to calculate this, but the results I'm getting look a little strange, so I'm wondering if I have made a mistake. L1D_PEND_MISS (0x48 with no mask) (https://software.intel.com/sites/products/documentation/doclib/stdxe/2013SP1/amplifierxe/pmp/events/l1d_pend_miss.html) appears to be the sum of pending L1 misses at each cycle, over how every many cycles we're measuring. Is that right? In that case, would dividing this counter by the number of elapsed cycles give me the average number of L1 fill buffers in use? I'm guessing that I should use UNHALTED_CORE_CYCLES (rather than UNHALTED_REFERENCE_CYCLES) as the denominator. However, when I compute this, I get much lower numbers than I'd expect (under 2, despite there being (I believe) 10 L1 MSHRs on Broadwell). Perhaps I need to divide by "number of cycles during which there was a pending miss" rather than "total number of cycles" to get an accurate number?

For L2, it's much less clear to me how to measure this. I can't find an equivalent count of pending misses. There is L2_RQSTS:LD_MISS (0x24 with mask 0x02), measuring total misses, which would allow me to compute average L2 misses per cycle, but it's not clear to me that that is the same quantity as "average outstanding L2 misses".

Any help with this would be much appreciated, thanks for your time! Happy to answer any clarifying questions.

HadiBrais · ‎08-04-2020

According to the specification update document (which you can find at: https://www.intel.com/content/www/us/en/processors/xeon/xeon-e5-v4-spec-update.html. It actually applies to all Xeon E5 series, not just the 2600v4 series), there is no erratum for the OFFCORE_REQUESTS_OUTSTANDING events. But still it's better to test them (who knows whether Intel may have forgotten to add the relevant erratum to the document).

Regarding your second question, I think if the cross-run variance of each event over many runs is acceptable, then the method you proposed should be OK.

View solution in original post

HadiBrais · ‎08-04-2020

You can use the following formula to measure the average number of occupied fill buffer entries at the L1D when there is at least one occupied entry:

L1D_PEND_MISS.PENDING / L1D_PEND_MISS.PENDING_CYCLES

where both events have the event code 0x48 and umask 0x01, but the second one has a cmask of 0x1 (which means that increment by one when there is at least one occupied fill buffer). Both events are documented in the Intel SDM Volume 3.

Unfortunately, on Broadwell, both of these events can only be counted on counter number 2, so they cannot be counted at the same time. If you tried to count both using Linux perf, they will be multiplexed.

(The paper you linked to mentions that the experiments were done on Haswell, which also has the same constraint. The authors were either not aware of the constraint or they thought it's not important to mention how they measured whatever metrics you wanted to measure. Many published papers have this problem. For this reason, when I see such a paper, I almost always consider their results to be unreliable.)

One workaround may be possible on processors that support hyperthreading. Each sibling thread has its own "counter 2." The fill buffer entries of the L1D are mostly shared between the two threads, so you can simultaneously measure one event on one thread and the other event on the other thread. In this case, the AnyThread attribute of the event must be set to 1 so that the LFB entries allocated by any thread are considered. If you're profiling a single-threaded program, you'll need a dummy thread to run on the sibling thread just to measure the other event.

Alternatively, this constraint was removed starting with Skylake, so you can switch to a more modern processor.

You can use the following formula to measure the average number of occupied MSHR entries at the L2 cache:

OFFCORE_REQUESTS_OUTSTANDING.ALL_DATA_RD / OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DATA_RD

These events are also documented in the manual. These events don't have constraints on Broadwell, so they can be counted together with the L1 events.

There are some issues with the OFFCORE_REQUESTS_OUTSTANDING events:

According to the documentation, only cacheable requests are considered. This is generally not a major problem, but it's something to keep in mind.
Only read requests are considered. My understanding is that the following types of requests are accounted for: demand data reads, demand code reads, demand RFOs, prefetch data reads, prefetch code reads, and prefetch RFOs. I think L1 writebacks that miss in the L2 are also allocated in the L1's MSHR, but they are not accounted for by these events.
According to the spec update documents, these events may overcount or undercount on all Broadwell processors except Xeon E5, Xeon E7, Xeon D, and Pentium D. It's not clear to me how big the error can be. They may be completely unreliable. You have to test them using microbenchmarks.

Most Haswell and Skylake processors have these same issues.

(These are all additional reasons why the results of that ambiguous paper are unreliable.)

The L1D_PEND_MISS.PENDING event count may increment by at most 10 per cycle because the size of the LFB is 10 on Broadwell. The OFFCORE_REQUESTS_OUTSTANDING.ALL_DATA_RD event count may increment by at most 16 per cycle because that's the size of the MSHR on Broadwell. Knowing these upper limits is useful because they bound the maximum MLP at the L1D and L2.

Ted_R_Cooper · ‎08-04-2020

Thanks @HadiBrais! The background information you've provided is incredibly helpful, and these formulae look like they'll address my questions, I'll test them out and let you know how it goes. If I correctly understand what you've written, the fact that I'm on Xeon E5 should mean that I don't have to worry about the possibility of OFFCORE_REQUESTS_OUTSTANDING over/undercount, right?

It will be difficult for me to get access to Skylake or newer hardware for benchmarking, so I'm considering the following approximation, and wondering whether it would be an improvement over perf multiplexing: Repeatedly run two instances of the same code with the same input, one recording L1D_PEND_MISS:PENDING, the other recording L1D_PEND_MISS:PENDING_CYCLES. Finally, compute mean(L1D_PEND_MISS.PENDING) / mean(L1D_PEND_MISS.PENDING_CYCLES), each mean over the corresponding instance. If I run enough repetitions, do you think that this could be a meaningful approximation?

HadiBrais · ‎08-04-2020

According to the specification update document (which you can find at: https://www.intel.com/content/www/us/en/processors/xeon/xeon-e5-v4-spec-update.html. It actually applies to all Xeon E5 series, not just the 2600v4 series), there is no erratum for the OFFCORE_REQUESTS_OUTSTANDING events. But still it's better to test them (who knows whether Intel may have forgotten to add the relevant erratum to the document).

Regarding your second question, I think if the cross-run variance of each event over many runs is acceptable, then the method you proposed should be OK.