topic Why flops more than 100% in Software Tuning, Performance Optimization & Platform Monitoring

Why flops more than 100%

GHui — Wed, 13 Sep 2017 09:55:18 GMT

I run my program on IVYBridge, I collect the following events. Sometimes flops more than 100%.

FP_COMP_OPS_EXE.X87
FP_COMP_OPS_EXE.SSE_FP_PACKED_DOUBLE
FP_COMP_OPS_EXE.SSE_FP_SCALAR_SINGLE
FP_COMP_OPS_EXE.SSE_FP_PACKED_SINGLE
FP_COMP_OPS_EXE.SSE_FP_SCALAR_DOUBLE
SIMD_FP_256.PACKED_SINGLE
SIMD_FP_256.PACKED_DOUBLE

X87	PackedD	ScalarS	PackedS	ScalarD	256PackedS	256PackedD	Max	Time1	Time2

14588.000000 	21455680.000000 	24.000000 	0.000000 	66765430.000000 	1247.000000 	59014271960.000000 	422400000000.000000 	0.500438 	0.500555  	111.723512

I use the following formula.

100 * ( (x87+4*256PackedD+4*256PackedD)/Time1 + (2*PackedD+ScalarS+2*PackedS+ScalarD)/Time2 ) / Max

The SIMD floating-point

McCalpinJohn — Wed, 13 Sep 2017 13:45:42 GMT

The SIMD floating-point counters on Sandy Bridge and Ivy Bridge are known to overcount. The amount of overcounting depends on how long the FP arithmetic instructions have to wait for their input arguments to be ready.

For data in L1 cache, the overcounting is very small (~3% on DGEMM).
For data in L2 cache, the overcounting is somewhat larger -- I seem to recall values in the 10% range.
For data in memory, the overcounting can be very large. With the STREAM benchmark using all cores, I have seen overcounting ratios of 6x to 10x.

The floating-point counters on Broadwell and Skylake are a new implementation and don't appear to have this problem.