Community support for Analyzers (Intel VTune™ Profiler, Intel Advisor, Intel Inspector)
4922 Discussions

Performance Counters: Difference between L1D.REPL and L1D_CACHE_LD.I_STATE on Nehalem Core i7

I am running a benchmark on a data processing system that I have developed.
The setup is:
- one application, running only one process on one core on an Intel box (core i7).
- the process reads a lot of data and processes it
- it does not modify the data
I am interested in looking at the amount of data cache lines that are not in L1D and need to be fetched.
Using OProfile, I have looked at the following two performance counters:L1D.REPL andL1D_CACHE_LD.I_STATE, using a counter of 10000.
According to the documentation that I have looked at, these are described as follows:
  • Counted L1D events (Counts the number of lines brought from/to the L1 data cache.) with a unit mask of 0x01 (repl Counts the number of lines brought into the L1 data cache) count 10000
  • Counted L1D_CACHE_LD events (Counts L1 data cache read requests.) with a unit mask of 0x01 (i_state Counts L1 data cache read requests where the cache line to be loaded is in the I (invalid) state, i) count 10000
As far as I understand, reading an i_state data cache line will lead to a cache line being brought into the L1D cache, thus generating a repl event.
Thus, I would expect that the two counters show similar values.
Interestingly enough they differ quite a lot, as can be seen from the output of opreport below ( and are the two main parts of my application) :
L1D.repl | %| L1D_CACHE_LD.istate| %|
samples | %| samples | %|
719367 54.1714 1919577 45.3786
603784 45.4675 2295587 54.2674
Roughly, there is an overall difference of 3million between the summation of the two counters.
L1D.repl is 1.3M while L1D_CACHE_LD.i_state is 4.3M.
Could some one please explain why this is happening (or even better), how I am miss-interpreting the purpose of the two performance counters?
0 Kudos
3 Replies
Hi Tudor,

I don't know why you care of L1D miss, actually the penalty of one L1D miss only costs extra 4-8 cycles. Usually most of developers care of L2 misses, LLC misses.

A count of L1D misses can be achieved with the use of all the MEM_LOAD_RETIRED
events, except MEM_LOAD_RETIRED.L1D_HIT:


Please read this article, written by Dr David Levinthal.

Hope it helps.

Regards, Peter

0 Kudos
Hi Peter,
Thank you very much for your reply.
(1) Due to the way I am processing the data, I am targeting for a very good locality in the L1 cache.
For this reason I need to be able to measure the L1D cache misses rather than the L2.
Even if the cost is of only a few cycles for a cache miss in L1, these can add up to a large cost.
(2) The article you pointed out is indeed great! Actually it states clearly that
NOTE: many of these events are known to overcount (l1d_cache_ld, l1d_cache_lock) so
they can only be used for qualitative analysis.
This would explain why I see such a huge difference between the L1D.REPL and L1D_CACHE_LD.I_STATE counters.
At least I hope that this is the cause of the difference between the values reported by the two counters.
If I am interpreting their meaning in a wrong way (i.e., they *should* report the same value), please correct me.
0 Kudos

Hi Tudor,

Thanks for explaining your requirements!

I agree that many events are overlapped...but the user should select them adequately...

In my view:
L1D.REPL is for L1D cache line flushing, driven by page fault and TLB will translate/reload data to L1D
L1D_CACHE_LD.I_STATE counts all L1D misses, that is what you want.

L1D cache miss happens - it doesn't mean L1D page fault.

Regards, Peter

0 Kudos