The underlying hardware performance counters don't provide the required information, so there is no way for PCM to compute GFLOPS.
If the new arithmetic operation counters in Broadwell work correctly, then it will be possible on that platform (and it looks like Skylake has the same support), but you will have to measure several different events, scale the results (and sum them) to get the total FP operation count.
There are 6 events [scalar, packed 128-bit, packed 256-bit] x [single precision, double precision]. An increment to one of these events corresponds to 1, 2, 4, or 8 FP operations, and the counters will increment twice for the fused multiply/add operations (thank goodness!).
So it is clear that counting all 6 events, scaling each by its "width" and summing the 6 scaled values will give you the total FP operation count.
It is not yet clear whether it will be possible to set multiple bits in the counter mask to collect the same sum using only 4 events:
- Single FP operation per increment: SCALAR_DOUBLE + SCALAR_SINGLE
- Two FP operations per increment: 128BIT_PACKED_DOUBLE
- Four FP operations per increment: 128BIT_PACKED_SINGLE + 256BIT_PACKED_DOUBLE
- Eight FP operations per increment: 256BIT_PACKED_SINGLE
This assumes that all FP operations are SSE or AVX -- to count x87 floating-point operations (difficult to generate with recent compilers, but still used in some codes) you would need a different counter events, and I don't see an x87 arithmetic operation event in the Broadwell counter documentation at https://download.01.org/perfmon/BDW/