I run 4 different threads simultaneously on 4 different cores of sandy bridge machine and want to count Resource stalls and L2 misses etc metrics per core basis. I use PAPI counters like RESOURCE_STALL:ANY and PAPI_L2_TCA on each thread. As PAPI counts on thread basis, it should give me the counts for every core separately as each thread is assigned to separate core. Is my approach right ? Or will there be any issues as all these threads are executed simultaneously ?
So... let's see if I understand.
1) You have a sandybridge with 4 cores and 2 logical cpus (hw threads) per core. (Total of 8 hw threads.)
2) The software you are wanting to profile runs with 4 software threads and you are setting the affinity such that you have 1 software thread per core.
3) I don't really know PAPI but I'll take your word for how it works. When you say 'PAPI counts on a thread basis, I assume PAPI counts on a hw thread basis. PAPI measures and reports the events for each hw thread (so if you are measure RESOURCE_STALLS, PAPI gives you the values on all 8 hw threads).
If PAPI works the way I think it does (counting on each hw thread) then you would need to sum the counts for the 2 hw threads on a core to see what is happening on the core. The counter for the hw thread onto which you pinned your sw thread should tell you the counts which your sw thread caused (assuming nothing else ran on the hw thread (or that PAPI tells you per sw thread counts). There is also the scope of the event to consider. Some events have a scope of 'per hw thread', some have other scopes, such as 'per socket'. The 'per socket' scope events for example, will return the same value no matter from which core you count them (as long as the cores are on the same socket).
I hope this helps,