I intend to count no. of cache misses involved while accessing a certain variable.
my questions are:
1) Among PAPI and intel PCM register based measurement which will have least of overhead?
2) Are PCM registers same across different intel processors?
I would appreciate if somebody can provide me with some code snippets or guide to appropriate link.
PCM probably doesn't have less overhead than PAPI because it uses a different register access method which requires context switches from user- to kernel-space for each register access. PAPI uses the Linux kernel interface perf_event for many measurements. perf_event handles everything in the kernel and provides a file-like interface to applications (PAPIunrz to read the results.
The minimal overhead is probably using perf_event directly or doing it manually by programming and starting the counter (through some interface PCM, perf_event, PAPI or LIKWID) and using the rdpmc instruction (only for core performance monitoring registers) around the variable. The rdpmc instruction should have the least overhead as it doesn't require context switches. But maybe you need some kind of serialization to be sure your register access is not executed before your variable access.
Moreover, your measurements will be always dominated by the register accesses, there is no method that reduces the read times to 10's of cycles. In the kernel and rdpmc are somewhere in the range of 100's of cycles. There are many discussions in this forum about overhead and which method is the fastest.
For your second question: The registers change between architectures but some are general, so it depends on which register you want to use. Core-local register addresses span multiple architectures (maybe with slight changes in the layout) but Uncore stuff doesn't. Uncore units may have more registers, the registers have different addresses and/or different layouts.
The only way to get performance counters that could even be remotely considered "low overhead" are user-mode RDPMC instructions. This only works for the core performance counters on the core that you are currently running on. RDPMC is a single instruction, but it is microcoded, so it is not instantaneous. I typically see intervals of ~24-36 cycles between consecutive counter reads, depending on the processor model. It is entirely possible that the overhead is dependent on what event(s) are being counted as well, but I have not looked for evidence of such variation.
If you look for "mccalpin" and "rdpmc" in these forums you will probably find lots of relevant posts....
You might want to consider libpfc which is a lightweight library pretty much just offering user-mode programming and access to the counters.
The author reports 240 cycles for one pmc read, which is much higher than the 24-36 that John mentions above, but I think that's for reading all 8 counters, so you can presumably just read 1 if you care about cache misses.
A lot of work has been done to make the reading fairly deterministic and "exact" results are apparently possible for many counter types.