- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I run my program on IVYBridge, I collect the following events. Sometimes flops more than 100%.
FP_COMP_OPS_EXE.X87
FP_COMP_OPS_EXE.SSE_FP_PACKED_DOUBLE
FP_COMP_OPS_EXE.SSE_FP_SCALAR_SINGLE
FP_COMP_OPS_EXE.SSE_FP_PACKED_SINGLE
FP_COMP_OPS_EXE.SSE_FP_SCALAR_DOUBLE
SIMD_FP_256.PACKED_SINGLE
SIMD_FP_256.PACKED_DOUBLE
X87 PackedD ScalarS PackedS ScalarD 256PackedS 256PackedD Max Time1 Time2 14588.000000 21455680.000000 24.000000 0.000000 66765430.000000 1247.000000 59014271960.000000 422400000000.000000 0.500438 0.500555 111.723512
I use the following formula.
100 * ( (x87+4*256PackedD+4*256PackedD)/Time1 + (2*PackedD+ScalarS+2*PackedS+ScalarD)/Time2 ) / Max
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The SIMD floating-point counters on Sandy Bridge and Ivy Bridge are known to overcount. The amount of overcounting depends on how long the FP arithmetic instructions have to wait for their input arguments to be ready.
- For data in L1 cache, the overcounting is very small (~3% on DGEMM).
- For data in L2 cache, the overcounting is somewhat larger -- I seem to recall values in the 10% range.
- For data in memory, the overcounting can be very large. With the STREAM benchmark using all cores, I have seen overcounting ratios of 6x to 10x.
The floating-point counters on Broadwell and Skylake are a new implementation and don't appear to have this problem.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page