Performance Counters: Difference between L1D.REPL and L1D_CACHE_LD.I_STATE on Nehalem Core i7

tsalomie · ‎07-08-2010

Hi,

I am running a benchmark on a data processing system that I have developed.

The setup is:

- one application, running only one process on one core on an Intel box (core i7).

- the process reads a lot of data and processes it

- it does not modify the data

I am interested in looking at the amount of data cache lines that are not in L1D and need to be fetched.

Using OProfile, I have looked at the following two performance counters:L1D.REPL andL1D_CACHE_LD.I_STATE, using a counter of 10000.

According to the documentation that I have looked at, these are described as follows:

Counted L1D events (Counts the number of lines brought from/to the L1 data cache.) with a unit mask of 0x01 (repl Counts the number of lines brought into the L1 data cache) count 10000
Counted L1D_CACHE_LD events (Counts L1 data cache read requests.) with a unit mask of 0x01 (i_state Counts L1 data cache read requests where the cache line to be loaded is in the I (invalid) state, i) count 10000

As far as I understand, reading an i_state data cache line will lead to a cache line being brought into the L1D cache, thus generating a repl event.

Thus, I would expect that the two counters show similar values.

Interestingly enough they differ quite a lot, as can be seen from the output of opreport below (libmy1.so and libmy2.so are the two main parts of my application) :

L1D.repl | %| L1D_CACHE_LD.istate| %|

samples | %| samples | %|

---------------------------------------------------------

719367 54.1714 1919577 45.3786 libmy1.so

603784 45.4675 2295587 54.2674 libmy2.so

Roughly, there is an overall difference of 3million between the summation of the two counters.

L1D.repl is 1.3M while L1D_CACHE_LD.i_state is 4.3M.

Could some one please explain why this is happening (or even better), how I am miss-interpreting the purpose of the two performance counters?

Thanks,

Tudor

Peter_W_Intel · ‎07-08-2010

Hi Tudor,

I don't know why you care of L1D miss, actually the penalty of one L1D miss only costs extra 4-8 cycles. Usually most of developers care of L2 misses, LLC misses.

A count of L1D misses can be achieved with the use of all the MEM_LOAD_RETIRED
events, except MEM_LOAD_RETIRED.L1D_HIT:

L1D_MISSES = MEM_LOAD_RETIRED.HIT_LFB +
MEM_LOAD_RETIRED.L2_HIT + MEM_LOAD_RETIRED.LLC_UNSHARED_HIT
+ MEM_LOAD_RETIRED.OTHER_CORE_HIT_HITM +
MEM_LOAD_RETIRED.LLC_MISS

Please read this article, written by Dr David Levinthal.

Hope it helps.

Regards, Peter

tsalomie · ‎07-09-2010

Hi Peter,

Thank you very much for your reply.

(1) Due to the way I am processing the data, I am targeting for a very good locality in the L1 cache.

For this reason I need to be able to measure the L1D cache misses rather than the L2.

Even if the cost is of only a few cycles for a cache miss in L1, these can add up to a large cost.

(2) The article you pointed out is indeed great! Actually it states clearly that

NOTE: many of these events are known to overcount (l1d_cache_ld, l1d_cache_lock) so
they can only be used for qualitative analysis.

This would explain why I see such a huge difference between the L1D.REPL and L1D_CACHE_LD.I_STATE counters.

At least I hope that this is the cause of the difference between the values reported by the two counters.

If I am interpreting their meaning in a wrong way (i.e., they *should* report the same value), please correct me.

Thanks,

Tudor.

Peter_W_Intel · ‎07-09-2010

Hi Tudor,

Thanks for explaining your requirements!

I agree that many events are overlapped...but the user should select them adequately...

In my view:
L1D.REPL is for L1D cache line flushing, driven by page fault and TLB will translate/reload data to L1D
L1D_CACHE_LD.I_STATE counts all L1D misses, that is what you want.

L1D cache miss happens - it doesn't mean L1D page fault.

Regards, Peter