I'm trying to use counters to quantify average outstanding L1 and L2 misses, that is, how many of the available miss status handling registers (seems like Intel calls these "fill buffers", at least for L1) are in use on average, and differentiate between this for L1, L2, and any other levels I can measure. I'm attempting to compare to the kind of analysis in a paper "Cimple: Instruction and Memory Level Parallelism", in particular, the "memory-level parallelism" metric described in Section 8.3.2 as "average outstanding L2 misses... (including) speculative and prefetch requests". My code does software prefetching using PREFETCHNTA, and I'd like to count it when doing so uses a miss status handling register/fill buffer.
For L1, I think I know how to calculate this, but the results I'm getting look a little strange, so I'm wondering if I have made a mistake. L1D_PEND_MISS (0x48 with no mask) appears to be the sum of pending L1 misses at each cycle, over how every many cycles we're measuring. Is that right? In that case, would dividing this counter by the number of elapsed cycles give me the average number of L1 fill buffers in use? I'm guessing that I should use UNHALTED_CORE_CYCLES (rather than UNHALTED_REFERENCE_CYCLES) as the denominator. However, when I compute this, I get much lower numbers than I'd expect (under 2, despite there being (I believe) 10 L1 MSHRs on Broadwell). Perhaps I need to divide by "number of cycles during which there was a pending miss" rather than "total number of cycles" to get an accurate number?
For L2, it's much less clear to me how to measure this. I can't find an equivalent count of pending misses. There is L2_RQSTS:LD_MISS (0x24 with mask 0x02), measuring total misses, which would allow me to compute average L2 misses per cycle, but it's not clear to me that that is the same quantity as "average outstanding L2 misses".
Any help with this would be much appreciated, thanks for your time! Happy to answer any clarifying questions.