Software Tuning, Performance Optimization & Platform Monitoring
Discussion regarding monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform updating.
公告
FPGA community forums and blogs on community.intel.com are migrating to the new Altera Community and are read-only. For urgent support needs during this transition, please visit the FPGA Design Resources page or contact an Altera Authorized Distributor.

Measuring FLOPS on intel i7 Skylake CPU using PCM

Huda_I_
初学者
2,059 次查看

Hello,

How can I use PCM to measure FLOPS of part of my program? I'm trying to specify custom events as bellow:

PCM::CustomCoreEventDescription events[NB_EVENTS];
double values[NB_EVENTS+1];
events[1].event_number = 0x10;
events[1].umask_value  = 0x01;
events[0].event_number = 0x10;
events[0].umask_value  = 0x80;
events[2].event_number = 0x10;
events[2].umask_value  = 0x10;
events[3].event_number = 0x11;
events[3].umask_value  = 0x02;

PCM * m = PCM::getInstance();
m->disableJKTWorkaround();
m->resetPMU();
if (m->program(PCM::CUSTOM_CORE_EVENTS,&events) != PCM::Success)  return;

SystemCounterState before_sstate = getSystemCounterState();
compute(h, n, k, A, B, C);
SystemCounterState after_sstate = getSystemCounterState();

for ( int i=0; i < NB_EVENTS; i++ ) {
        uint64 value = getNumberOfCustomEvents(i, pcm_before, pcm_after);
        values[i+1] = (double) value;
        printf("Event %0d: 0x%04x0x%04x: %lld\n", i+1, events.event_number, events.umask_value, value);
}

 

This results in: 

Event 1: 0x00100x0080: 0

Event 2: 0x00100x0001: 0

Event 3: 0x00100x0010: 0

Event 4: 0x00110x0002: 0

 

Is there something wrong with the way I'm setting the events? or is it the masks? I can't find a clear documentation of the events number and masks for floating point operations on Skylake (I'm using intel i7 6700HQ)

 

 

 

0 项奖励
6 回复数
Roman_D_Intel
员工
2,059 次查看

Hi,

you can use pmu-query.py python script to search/query available events on your processor. You can use the event/umask in your code or in PCM pcm-core.x utility as a command line parameter to monitor the events of interest.

Thanks,

Roman

0 项奖励
McCalpinJohn
名誉分销商 III
2,059 次查看

It looks like you are trying to use the old performance counter events 0x10 and 0x11 that were disabled starting with Haswell.  As you can see, you are still allowed to program these event numbers, but they always return zeros.

New FP performance counters were added starting with Broadwell. 

For Skylake these are documented at https://download.01.org/perfmon/SKL/Skylake_core_V24.json, with additional information on how to scale the results at https://download.01.org/perfmon/SKL/Skylake_FP_ARITH_INST_V24.json

0 项奖励
Huda_I_
初学者
2,059 次查看

Thank you for you replies!

I've updated the events and masks numbers. Unfortunately, I still can't interpret the results. For example:

PCM::CustomCoreEventDescription events[NB_EVENTS];
double values[NB_EVENTS+1];
// FP_ARITH_INST_RETIRED.SCALAR_DOUBLE
events[0].event_number = 0xC7;
events[0].umask_value  = 0x01;
// FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE
events[1].event_number = 0xC7;
events[1].umask_value  = 0x04;
// FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE
events[2].event_number = 0xC7;
events[2].umask_value  = 0x10;

PCM * m = PCM::getInstance();
m->disableJKTWorkaround();
m->resetPMU();
PCM::ErrorCode status = m->program(PCM::CUSTOM_CORE_EVENTS, events);

double aa=1,bb=2;
SystemCounterState before_sstate = getSystemCounterState();
aa = aa + bb;
SystemCounterState after_sstate = getSystemCounterState();

for ( int i=0; i < NB_EVENTS; i++ ) {
        uint64 value = getNumberOfCustomEvents(i, before_sstate, after_sstate);
        printf("Event %0d: 0x%04x0x%04x: %lld\n", i+1, events.event_number, events.umask_value, value);
}

The output: 

Event 1: 0x00c70x0001: 4609

Event 2: 0x00c70x0004: 2236

Event 3: 0x00c70x0010: 42

 
Cleary this doesn't  reflect the number in FLOPs of (aa=aa+bb;)
 
0 项奖励
McCalpinJohn
名誉分销商 III
2,059 次查看

I am not sure how many FLOPS you were expecting, but it is typically a good idea to have a loop with a controllable number of FLOPS so that you can look at expected vs reported values.... 

In a recent test, I used a version of the STREAM benchmark that I expected to generate slightly over 50.5 billion counts, and "perf stat" reported 50.875 billion counts using the 0x37 counters programmed with the "raw" events interface.   I have not tested on a wide variety of systems or with all supported instruction sets, but the hardware counting looks accurate so far.

0 项奖励
Huda_I_
初学者
2,059 次查看

I'm just measuring (a=a+b) where a and b are just double numbers, as in below:

 
double aa=1,bb=2;
SystemCounterState before_sstate = getSystemCounterState();
aa = aa + bb;
SystemCounterState after_sstate = getSystemCounterState();

and hence, I'm expecting much smaller counts than what I'm getting (shown in my previous post)

 

0 项奖励
Roman_D_Intel
员工
2,059 次查看

Hi,

getSystemCounterState collects statistics from all cores on the system, not just your code. PCM is a processor-centric API (no mapping to user thread). Though you can limit collection to a certain logical core (getCoreCounterState), then you need to pin your thread to this logical core (i.e. with pthread_setaffinity_np call or run your program from taskset utility on Linux), but this also does not guarantee that OS will not interrupt your program and schedule something else on this logical core during the measurement. Other aspect is that the PCM API itself might have some FP computation inside that adds up to the measured statistics. 

Following the John's advice you can significantly increase the amount of computation in the measurement region to minimize the relative side-effects.

Thanks,

Roman

0 项奖励
回复