Im currently working on the optimization of some algorithms.
During my work I found some odd behavior.
First some facts:
1) Im working in Kernel mode using some Windows Realtime Extension
2) I disabled interrupts and there are no context switches
3) I write back and invalidate the cache each time I run the algorithm
4) I am using an intel core architecture
5) The Algorithm mainly reads, modifies and writes back memory in a loop
6) The memory area I use is not being paged
Now look at the image below. What I don't understand is the behavior at the beginning. Why are there these peaks in execution time that settle after a few executions. Any idea?
Thanks in advance!
Just out of curiosity, how are you invalidating the cache?
0.28ms is a fairly short amount of time. What timer are you using to do the measurements? Maybe it's the measurement and not he routine that causes this variation. I would use rdtsc in this case.
The timer would have been too easy (but I had to ask because I've seen this too many times).
I assume that you have also verified what the hardware prefetchers are doing, e.g.measure the number of prefetched cache lines ordisable them to cross-check?
I was referring to the hardwarelogic that automatically prefetches data into the caches before it is requested. These hardware prefetchers can often be disabled in the BIOS. (The cache itself cannot be disabled.) Windows won't notice that the prefetchers are disabled other than that the system runs slower.
The performance counters provide events to monitor how many cache lines are fetched by misses and by the prefetchers. On the latest Intelarchitecture, such events are L1D_PREFETCH.* or L2_DATA_RQSTS.PREFETCH.* and some more.