Is there an Intel API for getting hardware counters from code? I'm talking about something like PAPI where you can start counters at the beginning of a function then stop the counters at the end and read them.
BTW, @John E., I think what Vitaly was getting at is, if you help us understand your need, it may be that VTune Amplifier can already address your need. Or, it may be something we will consider for a future release. So, we would appreciate your comments and wish you will with whatever tool you decide to use. :)
Thank you Vitaly and MrAnderson for your responses. I asked this question because I was told by two people (both with much more experience than I have) that this functionality exists in vtune but I was not able to discover how to use it. We are trying to understand memory usage in a micro benchmark and it seems to me that querying counters would be simpler, less intrusive and more accurate than the sampling approach. Maybe I’m wrong about that — are there disadvantages to the PCM-type approach besides the fact that it requires modifying code?
I have compiled and linked my executable with the PCM object files but it seems I need permissions to execute. I am running on a shared linux benchmarking machine. Do I need to talk to the administrator or is there another way to do this?
Saying "memory usage" do you want to see how much memory allocated by your workload or analyze memory bandwidth? As for the second one you can create custom analysis type based on Advanced Hotspots and select "Analyze memory bandwidth" option, then you should be able to see memory bandwidth read/write overtime data on timeline.
If you create a custom analysis type based on the General Exploration type, you can modify any and all of the sample after values. However, there is an easier way, which is to modify the "sampling interval" for the GE type. But, note, increasing the sampling rate is going to introduce more overhead and can therefore cause your results to be less accurate. There is a fine line and you need to walk it carefully when trying to get "more accurate" results.
The real different between PCM and VTune Amplifier's EBS is that PCM does not give you samples of *where* the events are occurring. You just get counter values. That can be good or bad, depending on what you want to do with the data. If what you want to measure is the cache misses for a loop, using PCM is probably a good idea. It will have lower overhead (although VTune Amplifier's EBS overhead is low) and you can focus on code. VTune Amplifier's EBS will help you narrow your focus to potential problem areas by showing you where, in your application, you are experiencing the most cache misses (for example).