Intel® Moderncode for Parallel Architectures
Support for developing parallel programming applications on Intel® Architecture.

Intel PCM vs. perf_event/PAPI correlation


I've been using Intel Performance Counter Monitor to validate results for ZSim (, a pintool-based microarchitectural simulator. The Intel PCM website directed me to this particular forum for help.

For certain multithreaded benchmarks, PCM and pintool are returning wildly different instruction counts. This happens even with one thread (but with locks, compare-and-swaps, etc. remaining). At first I thought this could be attributed to syscalls, but after testing out Linux perf_event performance counters, I discovered that it matches with pintool. I also tested with PAPI, a library that wraps the Linux perf_event interface. Any ideas as to what's going on?


For PCM, I copied the example code. I created SystemCounterStates and then called getInstructionsRetired(BeforeState, AfterState) for each core. Threads are pinned to cores.

I tested on a custom Breadth-First-Search graph problem:

PCM: 103,852,770 instructions

Pin 2.14: 73,739,015 instructions

Pin 3.0: 73,739,015 instructions

PAPI: 73,202,192 instructions

Directly calling perf_event: 75,610,848 instructions

Directly calling perf_event with exclude_kernel enabled: 73,199,366 instructions


As we can see, Pin correlates with the Linux perf_event results with exclude_kernel enabled (i.e. only measuring user-space code). Intel Performance Counter Monitor results are completely off by ~40%. Any ideas what's going on?

This is how I'm initializing and calling PCM (from my header file). I call getBeforeStates() before I call my kernel, and getAfterStates() after the kernel has completed. I measure using perf_event and PAPI in the same way.

PCM* m;
SystemCounterState SysBeforeState, SysAfterState;
//const uint32 ncores = m->getNumCores();
std::vector<CoreCounterState> BeforeState, AfterState;
std::vector<SocketCounterState> DummySocketStates;

void getBeforeStates() {
    m->getAllCounterStates(SysBeforeState, DummySocketStates, BeforeState);

void getAfterStates() {
    m->getAllCounterStates(SysAfterState, DummySocketStates, AfterState);

void initPCM(PCMEvent* WSMEvents) {
    m = PCM::getInstance();

    PCM::ExtendedCustomCoreEventDescription conf;
    conf.fixedCfg = NULL; // default
    conf.nGPCounters = 4;
    EventSelectRegister regs[4];
    conf.gpCounterCfg = regs;
    EventSelectRegister def_event_select_reg;
    def_event_select_reg.value = 0;
    def_event_select_reg.fields.usr = 1;
    def_event_select_reg.fields.os = 1;
    def_event_select_reg.fields.enable = 1;
    for(int i=0;i<4;++i)
        regs = def_event_select_reg;

    for(int i = 0; i < 4; i++) {
        regs.fields.event_select = WSMEvents.event;
        regs.fields.umask = WSMEvents.umask;

    PCM::ErrorCode status = m->program(PCM::EXT_CUSTOM_CORE_EVENTS, &conf);

void printCoreStats(PCMEvent* WSMEvents) {
    uint32_t numCores = m->getNumCores();
    uint64_t sum = 0;

    // Find critical path
    uint64_t max = 0;
    uint32_t maxIdx = -1;
    for(int i = 0; i < numCores; i++) {
        uint64_t cycles = getCycles(BeforeState, AfterState);
        if(cycles > max) {
            max = cycles;
            maxIdx = i;
    cout << "Cycles: " << max << "\n";


0 Kudos
2 Replies

Bump. Does anyone know if this is the right place to ask?

0 Kudos
Black Belt
Questions on pcm are usually taken on software performance forum. Contrary to header this forum appears to have no moderator with privilege to redirect.
0 Kudos