According to intel manual, L1D_REPL counts the number of lines brought into the L1 data cache in core2 CPUs. I tested this event on a Q8200 PC, but the result was not expected. I did the following:
1: set IA32_PCM0 to count L1D_REP event on all cores
2: disable all other core's cache
3: flush the cache hierarchy using wbinvd
4: load counters (high and low) into the cache
5: rdmsr(IA32_PCM0 , low, high)
6: access a buffer on a 64 byte boundary of 300*64 bytes (cache fills with 64 byte cache lines)
7: rdmsr(IA32_PCM0 , low, high)
8: calculate the occurrence of L1D fills
The expectation should be 300, but I got a variable value from 300~305. I first guessed that this may result from hardware perfetching, but when I minus the countedL1D_PREFETCH.REQUESTS events, the value is still not accurate enough. Do anyone kown the reason?
In order for this test to give the results you expect, you have to make sure that nothing else is happening on the cpu.
For instance, if you have any context switches, interrupts (such as OS clock interrupts, SMIs), calls to the OS, etc.
So there are many reasons why the count is probably correct and reflecting real L1_REPL. Some events do however count 'recirculated' events... that is, if the event takes a long time to complete, the miss may get counted again. I don't know if that is the case with this event. But that is one reason why I usually prefer to count 'retired' events (such as MEM_LOAD_RETIRED.L1D_MISS) which only counts the event at retirement.
It is delightful when you get exactly the count you expect, but that is a pretty unusual case too --- it is extremely difficult to prevent a system from doing anything else during the testing interval.
I usually go for lots of repetition and am happy if the counts are within a few percent of expectations (especially if they are high --- it is hard for interference to cause counts to go down, but easy for interference to make counts go up).
For memory operations, I usually see an extra few percent due to TLB walk traffic. It can be very hard to predict where the page table entries will be found for a given memory reference pattern, so tolerance of a bit of "fuzz" is necessary. Most of this goes away with 2 MiB pages, but not all of it.
It is worse with whole-program measurements, since it is not at all obvious how many cache and memory references will be required for the OS to instantiate a page, for example.
I suppose that this is not the problem which is related to the counter accuracy.I think that simply some code different than yours and even more priviledged was scheduled to run and did some cache line loading thus the number is not exactly 300.
Btw what OS are you using
You can try to do spin wait loop on that core thus preventing anything from running at that time.On Windows you can elevate IRQL to very high level thus you will be able to guarantee that only your code will run unless IPI interrupt will be issued.I suppose that you can do the same in Linux