I run my program on IVYBridge, I collect the following events. Sometimes flops more than 100%.
X87 PackedD ScalarS PackedS ScalarD 256PackedS 256PackedD Max Time1 Time2 14588.000000 21455680.000000 24.000000 0.000000 66765430.000000 1247.000000 59014271960.000000 422400000000.000000 0.500438 0.500555 111.723512
I use the following formula.
100 * ( (x87+4*256PackedD+4*256PackedD)/Time1 + (2*PackedD+ScalarS+2*PackedS+ScalarD)/Time2 ) / Max
The SIMD floating-point counters on Sandy Bridge and Ivy Bridge are known to overcount. The amount of overcounting depends on how long the FP arithmetic instructions have to wait for their input arguments to be ready.
- For data in L1 cache, the overcounting is very small (~3% on DGEMM).
- For data in L2 cache, the overcounting is somewhat larger -- I seem to recall values in the 10% range.
- For data in memory, the overcounting can be very large. With the STREAM benchmark using all cores, I have seen overcounting ratios of 6x to 10x.
The floating-point counters on Broadwell and Skylake are a new implementation and don't appear to have this problem.