Cache Optimization

xift · ‎05-06-2010

Hi guys,

Im currently working on the optimization of some algorithms.

During my work I found some odd behavior.

First some facts:

1) Im working in Kernel mode using some Windows Realtime Extension

2) I disabled interrupts and there are no context switches

3) I write back and invalidate the cache each time I run the algorithm

4) I am using an intel core architecture

5) The Algorithm mainly reads, modifies and writes back memory in a loop

6) The memory area I use is not being paged

Now look at the image below. What I don't understand is the behavior at the beginning. Why are there these peaks in execution time that settle after a few executions. Any idea?

Thanks in advance!

Regards,

Michael

Thomas_W_Intel · ‎05-06-2010

Michael,

Just out of curiosity, how are you invalidating the cache?

0.28ms is a fairly short amount of time. What timer are you using to do the measurements? Maybe it's the measurement and not he routine that causes this variation. I would use rdtsc in this case.

Kind regards

Thomas

xift · ‎05-06-2010

Hey Thomas,

I'm invalidating the caches using the wbinvd command.
For the measurements I am using the rdtsc (and performance counters to measure cache misses etc).
I'm using c with inline assembly. The measurement routines should not make any difference...

Another interesting thing is:
At these peaks in execution time the l1-cache misses decrease while the l2-cache misses increase.

Thomas_W_Intel · ‎05-06-2010

The timer would have been too easy (but I had to ask because I've seen this too many times).

I assume that you have also verified what the hardware prefetchers are doing, e.g.measure the number of prefetched cache lines ordisable them to cross-check?

xift · ‎05-06-2010

I measured the L1/L2 Requests...
The graphs show exactly the same peaks.
I also added some loops and serializing instructions before running the algorithm to make sure that the context switch to kernel mode is done and everything is settled. No changes...

Disable the caches? I didn't try that. But I don't know if the windows platform would like it...
What do you suggest. I can't figure out anything that would lead to a conclusion what influences the behavior.

Is there some instance apart from the caches that prefetches data?

Thomas_W_Intel · ‎05-06-2010

Michael,

I was referring to the hardwarelogic that automatically prefetches data into the caches before it is requested. These hardware prefetchers can often be disabled in the BIOS. (The cache itself cannot be disabled.) Windows won't notice that the prefetchers are disabled other than that the system runs slower.

The performance counters provide events to monitor how many cache lines are fetched by misses and by the prefetchers. On the latest Intelarchitecture, such events are L1D_PREFETCH.* or L2_DATA_RQSTS.PREFETCH.* and some more.

Kind regards
Thomas

xift · ‎05-06-2010

Mhh these events would be interesting. Too bad my pc is too old!

These events are available on i7 and xeon 5500 only.

I've never seen bios settings for prefetching (never looked for it either). I'll try that tomorrow.

Afaik caching can be prevented (same as disabled for me) using the control registers.

Best regards,

Michael

xift · ‎05-07-2010

Okay,
I turned off hardware prefetching now.
And then added some serializing instruction after wbinvd just to make sure everything is settled.

Without software prefetching, l1-cache misses stay stable after the first run (which has some more misses).
Nevertheless L2-misses take about 900 runs to settle...

Strange thing!

Thomas_W_Intel · ‎05-07-2010

Michael,

On Intel Core architecture, you can use the event L2_LD.SELF.PREFETCH.* to monitor the traffic by the prefetchers and compare it to the L2_LD.SELF.DEMAND.*.

Kind regards
Thomas