I am running with remote linux collector on a pentium3 and vtune 6.1. And I am trying to optimize a single hot method. VTune says it has lots of branch mispredicts, cache misses, etc. But when I drill down to source view, the samples seem very far off. For example for the branch missprediction, I have a loop that has like 200 lines in it, and on most of the lines there shows up a branch misspred. event. Similiarly if I sample for prefetch events, shouldn't they only show up when I call prefetch? But instead the events are showing way spread out. Do I need to sample more events? Currently to benchmark this method, I have made a test driver that just executes this method in a loop for 3000 times. Which takes about 10 seconds. Well let me know what I should be doing. Thanks, Brian
If you're looking only at source view, try asm view. The events should show up within a few instructions after the one responsible. If you are using a compiler which generates software prefetch, you need to do this anyway, to see where the compiler inserted prefetch, and whether it duplicates your own prefetch. If you have hot spots in code without symbols (-g), you may get misleading results. Repeating the method this way is almost certain to skew the results. If you have made 3000 data sets, you probably get way more cache misses than with normal usage. If you repeated with a single data set, you should get way less cache and branch misses than normal.
1) For me source and asm view are the same since the function is written in inline assembly. So also there are no compiler inserted prefetch.
2) I am compiling with -03 and -g.
3) One of the reasons I made the benchmark this way, is because it is representative of the way it will be used. This is part of a graphics program and this method will be called each frame. But it is true that I should change the data each frame as will be in the real world. I will fix the test, but this still doesn't address my main problem with skew.
I forgot to mention that the dataset is a screenfull of pixels so 720*480*4 so it will not all fit in cache. And the method goes through and processes each pixel each frame. So the data can't stay in the cache.
There really is no way around that. Especially on branch mispredicts. The reason you are seeing the skew is b/c of the time between the counter overflowing and the ISR firing. With branch mispredicts you will generally see samples all over the place because the instruction that caused the overflow is branch. However with other events like the cache miss events they will occur usually a line of assembly after they actually happen on the Pentium 3. On the Pentium 4 some events are called precise events and they use hardware to record the IP of the instruction that caused the counter overflow. With these events you will see the sample on the exact line of source that caused the counter to overflow.