There's some counters in the Cache category. But I cannot distinguish Copy read hits % and Data map hits %. Actually, i want to count the cache hits when the processor is fetching data from cache when referring a memory address. How to get a counter result during a procedure inside a program? I mean, how to count only the part of code I'm more concerned with? Can I write a code for counters inside my own program? Is that a secret of Intel? Does Intel have such register can get from assembly code? Or have API based on Windows? Thanks!
The documented Vtune API's allow you to write calls to turn counting on and off where you choose in your program, so that the events displayed in the Vtune GUI are only those where your program requests counting. Standard Vtune training, and my experience, doesn't go this deep. Even when you have the events turned on for an entire profiling run, by GUI control, you can see which event counts are associated with the source and generated asm code you are interested in by "drilling down" from the Hotspots view. For a P4 processor, if you go to Configure Event Based Sampling, choose All Events, you will see several L2 cache read events available, with a brief explanation available for each. These don't necessarily map to a single hardware event counter, and a great deal of expert trial and error has gone into programming Vtune to analyze the data in a useful way.
What kind of processor are you using? If you are using a Pentium 4 you should use L1 and L2 cache load misses retired. A more meaningful metric is to look at the number of L2 cache load misses retired per clock cycle. If you select all the events from the "Performance Tuning Events - Primary" group, the Intel Tuning Assistant in VTune 6.1 can analyze the data and tell you the performance impact of the cache misses in your code. For the Pentium III you want to use the event group "Events for Tuning Assistant Advice"
Thanks for ur helpful responses!!! I'm doing research on P4. The counter in counter monitor, like Data Hit Map Hit% made no sense to me. I turned to the L2 cache load misses retired in the event based sampling. This makes sense to me. One question here: What's the problem with the event based sampling runs? When I set a lot of samplings I wanna see, the program executed a lot of times and can not stop. I heard for calibration, it should run 2 times or 3? But why the program is executed again and again?
Oh, I see. That is because I set too many event samplings. When I set the samplings like CPI, L1 cache load miss rate, L2 cache load miss rate, the program has to be run 6 times. Is there any way to reduce the times the program run? Sometimes, I wanna see event samplings more than 3, and the program need a lot of time to run for only once. So, the performance analyzer will occupy most of my time.
You may also notice that you can see how many times VTune will run a sampling activity if you go to Configure->ModifyActivity->Events. Near the bottom of the screen is a box labeled "Run Information" that shows the number of calibration and normal runs. Sometimes you can play with different event combinations to reduce the run count.