Software Tuning, Performance Optimization & Platform Monitoring
Discussion regarding monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform updating.

Intel PCM vs. perf_event/PAPI correlation


I've been using Intel Performance Counter Monitor to validate results for ZSim (, a pintool-based microarchitectural simulator. I've been having issues with PCM's accuracy relative to other performance counter tools.

For certain multithreaded benchmarks, PCM and pintool are returning wildly different instruction counts. This happens even with one thread (but with locks, compare-and-swaps, etc. remaining). At first I thought this could be attributed to syscalls, but after testing out Linux perf_event performance counters, I discovered that it matches with pintool. I also tested with PAPI, a library that wraps the Linux perf_event interface. Any ideas as to what's going on?


For PCM, I copied the example code. I created SystemCounterStates and then called getInstructionsRetired(BeforeState, AfterState) for each core. Threads are pinned to cores.

I tested on a custom Breadth-First-Search graph problem:

PCM: 103,852,770 instructions

Pin 2.14: 73,739,015 instructions

Pin 3.0: 73,739,015 instructions

PAPI: 73,202,192 instructions

Directly calling perf_event: 75,610,848 instructions

Directly calling perf_event with exclude_kernel enabled: 73,199,366 instructions


As we can see, Pin correlates with the Linux perf_event results with exclude_kernel enabled (i.e. only measuring user-space code). Intel Performance Counter Monitor results are completely off by ~40%. Any ideas what's going on?

This is how I'm initializing and calling PCM (from my header file). I call getBeforeStates() before I call my kernel, and getAfterStates() after the kernel has completed. I measure using perf_event and PAPI in the same way.

PCM* m;
SystemCounterState SysBeforeState, SysAfterState;
//const uint32 ncores = m->getNumCores();
std::vector<CoreCounterState> BeforeState, AfterState;
std::vector<SocketCounterState> DummySocketStates;

void getBeforeStates() {
    m->getAllCounterStates(SysBeforeState, DummySocketStates, BeforeState);

void getAfterStates() {
    m->getAllCounterStates(SysAfterState, DummySocketStates, AfterState);

void initPCM(PCMEvent* WSMEvents) {
    m = PCM::getInstance();

    PCM::ExtendedCustomCoreEventDescription conf;
    conf.fixedCfg = NULL; // default
    conf.nGPCounters = 4;
    EventSelectRegister regs[4];
    conf.gpCounterCfg = regs;
    EventSelectRegister def_event_select_reg;
    def_event_select_reg.value = 0;
    def_event_select_reg.fields.usr = 1;
    def_event_select_reg.fields.os = 1;
    def_event_select_reg.fields.enable = 1;
    for(int i=0;i<4;++i)
        regs = def_event_select_reg;

    for(int i = 0; i < 4; i++) {
        regs.fields.event_select = WSMEvents.event;
        regs.fields.umask = WSMEvents.umask;

    PCM::ErrorCode status = m->program(PCM::EXT_CUSTOM_CORE_EVENTS, &conf);

void printCoreStats(PCMEvent* WSMEvents) {
    uint32_t numCores = m->getNumCores();
    uint64_t sum = 0;

    // Find critical path
    uint64_t max = 0;
    uint32_t maxIdx = -1;
    for(int i = 0; i < numCores; i++) {
        uint64_t cycles = getCycles(BeforeState, AfterState);
        if(cycles > max) {
            max = cycles;
            maxIdx = i;
    cout << "Cycles: " << max << "\n";
0 Kudos
1 Reply

Hi Dan Z,

PCM counts events for the for hardware thread (logical core), socket (CPU), system. Therefore PCM counts events triggered not only by your program/user thread.



0 Kudos