- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi guys,
Im currently working on the optimization of some algorithms.
During my work I found some odd behavior.
First some facts:
1) Im working in Kernel mode using some Windows Realtime Extension
2) I disabled interrupts and there are no context switches
3) I write back and invalidate the cache each time I run the algorithm
4) I am using an intel core architecture
5) The Algorithm mainly reads, modifies and writes back memory in a loop
6) The memory area I use is not being paged
Now look at the image below. What I don't understand is the behavior at the beginning. Why are there these peaks in execution time that settle after a few executions. Any idea?
Thanks in advance!
Regards,
Michael
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Michael,
Just out of curiosity, how are you invalidating the cache?
0.28ms is a fairly short amount of time. What timer are you using to do the measurements? Maybe it's the measurement and not he routine that causes this variation. I would use rdtsc in this case.
Kind regards
Thomas
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'm invalidating the caches using the wbinvd command.
For the measurements I am using the rdtsc (and performance counters to measure cache misses etc).
I'm using c with inline assembly. The measurement routines should not make any difference...
Another interesting thing is:
At these peaks in execution time the l1-cache misses decrease while the l2-cache misses increase.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The timer would have been too easy (but I had to ask because I've seen this too many times).
I assume that you have also verified what the hardware prefetchers are doing, e.g.measure the number of prefetched cache lines ordisable them to cross-check?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The graphs show exactly the same peaks.
I also added some loops and serializing instructions before running the algorithm to make sure that the context switch to kernel mode is done and everything is settled. No changes...
Disable the caches? I didn't try that. But I don't know if the windows platform would like it...
What do you suggest. I can't figure out anything that would lead to a conclusion what influences the behavior.
Is there some instance apart from the caches that prefetches data?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Michael,
I was referring to the hardwarelogic that automatically prefetches data into the caches before it is requested. These hardware prefetchers can often be disabled in the BIOS. (The cache itself cannot be disabled.) Windows won't notice that the prefetchers are disabled other than that the system runs slower.
The performance counters provide events to monitor how many cache lines are fetched by misses and by the prefetchers. On the latest Intelarchitecture, such events are L1D_PREFETCH.* or L2_DATA_RQSTS.PREFETCH.* and some more.
Kind regards
Thomas
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I turned off hardware prefetching now.
And then added some serializing instruction after wbinvd just to make sure everything is settled.
Without software prefetching, l1-cache misses stay stable after the first run (which has some more misses).
Nevertheless L2-misses take about 900 runs to settle...
Strange thing!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
On Intel Core architecture, you can use the event L2_LD.SELF.PREFETCH.* to monitor the traffic by the prefetchers and compare it to the L2_LD.SELF.DEMAND.*.
Kind regards
Thomas

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page