Measuring FLOPS on intel i7 Skylake CPU using PCM

Huda_I_ · ‎03-08-2017

Hello,

How can I use PCM to measure FLOPS of part of my program? I'm trying to specify custom events as bellow:

PCM::CustomCoreEventDescription events[NB_EVENTS];
double values[NB_EVENTS+1];
events[1].event_number = 0x10;
events[1].umask_value  = 0x01;
events[0].event_number = 0x10;
events[0].umask_value  = 0x80;
events[2].event_number = 0x10;
events[2].umask_value  = 0x10;
events[3].event_number = 0x11;
events[3].umask_value  = 0x02;

PCM * m = PCM::getInstance();
m->disableJKTWorkaround();
m->resetPMU();
if (m->program(PCM::CUSTOM_CORE_EVENTS,&events) != PCM::Success)  return;

SystemCounterState before_sstate = getSystemCounterState();
compute(h, n, k, A, B, C);
SystemCounterState after_sstate = getSystemCounterState();

for ( int i=0; i < NB_EVENTS; i++ ) {
        uint64 value = getNumberOfCustomEvents(i, pcm_before, pcm_after);
        values[i+1] = (double) value;
        printf("Event %0d: 0x%04x0x%04x: %lld\n", i+1, events.event_number, events.umask_value, value);
}

This results in:

Event 1: 0x00100x0080: 0

Event 2: 0x00100x0001: 0

Event 3: 0x00100x0010: 0

Event 4: 0x00110x0002: 0

Is there something wrong with the way I'm setting the events? or is it the masks? I can't find a clear documentation of the events number and masks for floating point operations on Skylake (I'm using intel i7 6700HQ)

Roman_D_Intel · ‎03-09-2017

Hi,

you can use pmu-query.py python script to search/query available events on your processor. You can use the event/umask in your code or in PCM pcm-core.x utility as a command line parameter to monitor the events of interest.

Thanks,

Roman

McCalpinJohn · ‎03-09-2017

It looks like you are trying to use the old performance counter events 0x10 and 0x11 that were disabled starting with Haswell. As you can see, you are still allowed to program these event numbers, but they always return zeros.

New FP performance counters were added starting with Broadwell.

For Skylake these are documented at https://download.01.org/perfmon/SKL/Skylake_core_V24.json, with additional information on how to scale the results at https://download.01.org/perfmon/SKL/Skylake_FP_ARITH_INST_V24.json

Huda_I_ · ‎03-09-2017

Thank you for you replies!

I've updated the events and masks numbers. Unfortunately, I still can't interpret the results. For example:

PCM::CustomCoreEventDescription events[NB_EVENTS];
double values[NB_EVENTS+1];
// FP_ARITH_INST_RETIRED.SCALAR_DOUBLE
events[0].event_number = 0xC7;
events[0].umask_value  = 0x01;
// FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE
events[1].event_number = 0xC7;
events[1].umask_value  = 0x04;
// FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE
events[2].event_number = 0xC7;
events[2].umask_value  = 0x10;

PCM * m = PCM::getInstance();
m->disableJKTWorkaround();
m->resetPMU();
PCM::ErrorCode status = m->program(PCM::CUSTOM_CORE_EVENTS, events);

double aa=1,bb=2;
SystemCounterState before_sstate = getSystemCounterState();
aa = aa + bb;
SystemCounterState after_sstate = getSystemCounterState();

for ( int i=0; i < NB_EVENTS; i++ ) {
        uint64 value = getNumberOfCustomEvents(i, before_sstate, after_sstate);
        printf("Event %0d: 0x%04x0x%04x: %lld\n", i+1, events.event_number, events.umask_value, value);
}

The output:

Event 1: 0x00c70x0001: 4609

Event 2: 0x00c70x0004: 2236

Event 3: 0x00c70x0010: 42

Cleary this doesn't reflect the number in FLOPs of (aa=aa+bb;)

McCalpinJohn · ‎03-09-2017

I am not sure how many FLOPS you were expecting, but it is typically a good idea to have a loop with a controllable number of FLOPS so that you can look at expected vs reported values....

In a recent test, I used a version of the STREAM benchmark that I expected to generate slightly over 50.5 billion counts, and "perf stat" reported 50.875 billion counts using the 0x37 counters programmed with the "raw" events interface. I have not tested on a wide variety of systems or with all supported instruction sets, but the hardware counting looks accurate so far.

Huda_I_ · ‎03-09-2017

I'm just measuring (a=a+b) where a and b are just double numbers, as in below:

double aa=1,bb=2;
SystemCounterState before_sstate = getSystemCounterState();
aa = aa + bb;
SystemCounterState after_sstate = getSystemCounterState();

and hence, I'm expecting much smaller counts than what I'm getting (shown in my previous post)

Roman_D_Intel · ‎03-10-2017

Hi,

getSystemCounterState collects statistics from all cores on the system, not just your code. PCM is a processor-centric API (no mapping to user thread). Though you can limit collection to a certain logical core (getCoreCounterState), then you need to pin your thread to this logical core (i.e. with pthread_setaffinity_np call or run your program from taskset utility on Linux), but this also does not guarantee that OS will not interrupt your program and schedule something else on this logical core during the measurement. Other aspect is that the PCM API itself might have some FP computation inside that adds up to the measured statistics.

Following the John's advice you can significantly increase the amount of computation in the measurement region to minimize the relative side-effects.

Thanks,

Roman