Measuring D1 and L2 cache miss stall cycles

kkrik555 · ‎01-16-2008

I am interested in getting execution time breakdown to cache miss and resource stalls, for Core 2 E6300. I use the following equations:
D1-miss cycles =L1D_REPL * 8 / CPU_CLK_UNHALTED
L2-miss cycles =L2_LINES_IN * 60 / CPU_CLK_UNHALTED
Resource stall cycles = RESOURCE_STALLS / CPU_CLK_UNHALTED
, assuming 8 cycles per D1 miss (overestimate due to OoO execution) and 60 cycles per L2 miss (underestimate, 200 cycles suggested), while sampling frequencies are the same.

In some cases, I get a measured sum of D1-miss and L2-miss cycles greater than resource stall cycles (90% versus 20%), which is unacceptable. Correct me if I am wrong, but resource stalls contain the actual stall cycles for the pipeline, so they should be a super-set of cache miss stall cycles. Out-of-Order execution and non-blocking caches allow parallel cache line fetch, so the finally measured stall cycles dued to cache misses should be fewer than my estimations, but still this affects mostly L1 misses. So, in any case, resource stalls should be higher than cache miss stalls. Note that I run single-threaded applications on data-intensive workloads (TPC-H database benchmark).

Can you help me with the problem in the methodology I have adopted? Thanks.

Thomas_W_Intel · ‎03-14-2008

Hello,

RESOURCE_STALLS count only resource-related stalls, e.g. recovering from a mispredicted branchor the reorder buffer or the reservation station are full.

I think that the event that you are looking for is UOPS_DISPATCHED.NONE.

Best regards

Thomas