- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
How can I use PCM to measure FLOPS of part of my program? I'm trying to specify custom events as bellow:
PCM::CustomCoreEventDescription events[NB_EVENTS]; double values[NB_EVENTS+1]; events[1].event_number = 0x10; events[1].umask_value = 0x01; events[0].event_number = 0x10; events[0].umask_value = 0x80; events[2].event_number = 0x10; events[2].umask_value = 0x10; events[3].event_number = 0x11; events[3].umask_value = 0x02; PCM * m = PCM::getInstance(); m->disableJKTWorkaround(); m->resetPMU(); if (m->program(PCM::CUSTOM_CORE_EVENTS,&events) != PCM::Success) return; SystemCounterState before_sstate = getSystemCounterState(); compute(h, n, k, A, B, C); SystemCounterState after_sstate = getSystemCounterState(); for ( int i=0; i < NB_EVENTS; i++ ) { uint64 value = getNumberOfCustomEvents(i, pcm_before, pcm_after); values[i+1] = (double) value; printf("Event %0d: 0x%04x0x%04x: %lld\n", i+1, events.event_number, events.umask_value, value); }
This results in:
Event 1: 0x00100x0080: 0
Event 2: 0x00100x0001: 0
Event 3: 0x00100x0010: 0
Event 4: 0x00110x0002: 0
Is there something wrong with the way I'm setting the events? or is it the masks? I can't find a clear documentation of the events number and masks for floating point operations on Skylake (I'm using intel i7 6700HQ)
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
you can use pmu-query.py python script to search/query available events on your processor. You can use the event/umask in your code or in PCM pcm-core.x utility as a command line parameter to monitor the events of interest.
Thanks,
Roman
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It looks like you are trying to use the old performance counter events 0x10 and 0x11 that were disabled starting with Haswell. As you can see, you are still allowed to program these event numbers, but they always return zeros.
New FP performance counters were added starting with Broadwell.
For Skylake these are documented at https://download.01.org/perfmon/SKL/Skylake_core_V24.json, with additional information on how to scale the results at https://download.01.org/perfmon/SKL/Skylake_FP_ARITH_INST_V24.json
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you for you replies!
I've updated the events and masks numbers. Unfortunately, I still can't interpret the results. For example:
PCM::CustomCoreEventDescription events[NB_EVENTS]; double values[NB_EVENTS+1]; // FP_ARITH_INST_RETIRED.SCALAR_DOUBLE events[0].event_number = 0xC7; events[0].umask_value = 0x01; // FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE events[1].event_number = 0xC7; events[1].umask_value = 0x04; // FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE events[2].event_number = 0xC7; events[2].umask_value = 0x10; PCM * m = PCM::getInstance(); m->disableJKTWorkaround(); m->resetPMU(); PCM::ErrorCode status = m->program(PCM::CUSTOM_CORE_EVENTS, events); double aa=1,bb=2; SystemCounterState before_sstate = getSystemCounterState(); aa = aa + bb; SystemCounterState after_sstate = getSystemCounterState(); for ( int i=0; i < NB_EVENTS; i++ ) { uint64 value = getNumberOfCustomEvents(i, before_sstate, after_sstate); printf("Event %0d: 0x%04x0x%04x: %lld\n", i+1, events.event_number, events.umask_value, value); }
The output:
Event 1: 0x00c70x0001: 4609
Event 2: 0x00c70x0004: 2236
Event 3: 0x00c70x0010: 42
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I am not sure how many FLOPS you were expecting, but it is typically a good idea to have a loop with a controllable number of FLOPS so that you can look at expected vs reported values....
In a recent test, I used a version of the STREAM benchmark that I expected to generate slightly over 50.5 billion counts, and "perf stat" reported 50.875 billion counts using the 0x37 counters programmed with the "raw" events interface. I have not tested on a wide variety of systems or with all supported instruction sets, but the hardware counting looks accurate so far.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'm just measuring (a=a+b) where a and b are just double numbers, as in below:
double aa=1,bb=2; SystemCounterState before_sstate = getSystemCounterState(); aa = aa + bb; SystemCounterState after_sstate = getSystemCounterState();
and hence, I'm expecting much smaller counts than what I'm getting (shown in my previous post)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
getSystemCounterState collects statistics from all cores on the system, not just your code. PCM is a processor-centric API (no mapping to user thread). Though you can limit collection to a certain logical core (getCoreCounterState), then you need to pin your thread to this logical core (i.e. with pthread_setaffinity_np call or run your program from taskset utility on Linux), but this also does not guarantee that OS will not interrupt your program and schedule something else on this logical core during the measurement. Other aspect is that the PCM API itself might have some FP computation inside that adds up to the measured statistics.
Following the John's advice you can significantly increase the amount of computation in the measurement region to minimize the relative side-effects.
Thanks,
Roman
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page