Community
cancel
Showing results for 
Search instead for 
Did you mean: 
tsalomie
Beginner
121 Views

Performance Counters: Difference between L1D.REPL and L1D_CACHE_LD.I_STATE on Nehalem Core i7

Hi,
I am running a benchmark on a data processing system that I have developed.
The setup is:
- one application, running only one process on one core on an Intel box (core i7).
- the process reads a lot of data and processes it
- it does not modify the data
I am interested in looking at the amount of data cache lines that are not in L1D and need to be fetched.
Using OProfile, I have looked at the following two performance counters:L1D.REPL andL1D_CACHE_LD.I_STATE, using a counter of 10000.
According to the documentation that I have looked at, these are described as follows:
  • Counted L1D events (Counts the number of lines brought from/to the L1 data cache.) with a unit mask of 0x01 (repl Counts the number of lines brought into the L1 data cache) count 10000
  • Counted L1D_CACHE_LD events (Counts L1 data cache read requests.) with a unit mask of 0x01 (i_state Counts L1 data cache read requests where the cache line to be loaded is in the I (invalid) state, i) count 10000
As far as I understand, reading an i_state data cache line will lead to a cache line being brought into the L1D cache, thus generating a repl event.
Thus, I would expect that the two counters show similar values.
Interestingly enough they differ quite a lot, as can be seen from the output of opreport below (libmy1.so and libmy2.so are the two main parts of my application) :
L1D.repl | %| L1D_CACHE_LD.istate| %|
samples | %| samples | %|
---------------------------------------------------------
719367 54.1714 1919577 45.3786 libmy1.so
603784 45.4675 2295587 54.2674 libmy2.so
Roughly, there is an overall difference of 3million between the summation of the two counters.
L1D.repl is 1.3M while L1D_CACHE_LD.i_state is 4.3M.
Could some one please explain why this is happening (or even better), how I am miss-interpreting the purpose of the two performance counters?
Thanks,
Tudor
0 Kudos
3 Replies
Peter_W_Intel
Employee
121 Views

Hi Tudor,

I don't know why you care of L1D miss, actually the penalty of one L1D miss only costs extra 4-8 cycles. Usually most of developers care of L2 misses, LLC misses.

A count of L1D misses can be achieved with the use of all the MEM_LOAD_RETIRED
events, except MEM_LOAD_RETIRED.L1D_HIT:

L1D_MISSES = MEM_LOAD_RETIRED.HIT_LFB +
MEM_LOAD_RETIRED.L2_HIT + MEM_LOAD_RETIRED.LLC_UNSHARED_HIT
+ MEM_LOAD_RETIRED.OTHER_CORE_HIT_HITM +
MEM_LOAD_RETIRED.LLC_MISS

Please read this article, written by Dr David Levinthal.

Hope it helps.

Regards, Peter

tsalomie
Beginner
121 Views

Hi Peter,
Thank you very much for your reply.
(1) Due to the way I am processing the data, I am targeting for a very good locality in the L1 cache.
For this reason I need to be able to measure the L1D cache misses rather than the L2.
Even if the cost is of only a few cycles for a cache miss in L1, these can add up to a large cost.
(2) The article you pointed out is indeed great! Actually it states clearly that
NOTE: many of these events are known to overcount (l1d_cache_ld, l1d_cache_lock) so
they can only be used for qualitative analysis.
This would explain why I see such a huge difference between the L1D.REPL and L1D_CACHE_LD.I_STATE counters.
At least I hope that this is the cause of the difference between the values reported by the two counters.
If I am interpreting their meaning in a wrong way (i.e., they *should* report the same value), please correct me.
Thanks,
Tudor.
Peter_W_Intel
Employee
121 Views

Hi Tudor,

Thanks for explaining your requirements!

I agree that many events are overlapped...but the user should select them adequately...

In my view:
L1D.REPL is for L1D cache line flushing, driven by page fault and TLB will translate/reload data to L1D
L1D_CACHE_LD.I_STATE counts all L1D misses, that is what you want.

L1D cache miss happens - it doesn't mean L1D page fault.

Regards, Peter

Reply