Solved: How to measure flops on v4

GHui · ‎08-23-2016

I cannot find FP* events on v4 via 64-ia-32-architectures-software-developer-manual-325462.pdf. Is there any manuals to show that?

McCalpinJohn · ‎08-24-2016

The events are documented at https://download.01.org/perfmon/BDW/Broadwell_core_V16.json -- look for "FP_ARITH" and you will find the various sub-events of the new 0xC7 core performance counter event.

View solution in original post

McCalpinJohn · ‎08-24-2016

The events are documented at https://download.01.org/perfmon/BDW/Broadwell_core_V16.json -- look for "FP_ARITH" and you will find the various sub-events of the new 0xC7 core performance counter event.

GHui · ‎08-26-2016

I've collect the follwing events, and run xhpl for test.

FP_ARITH_INST_RETIRED.SCALAR_DOUBLE
FP_ARITH_INST_RETIRED.SCALAR_SINGLE
FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE
FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE
FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE
FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE
FP_ARITH_INST_RETIRED.SCALAR
FP_ARITH_INST_RETIRED.PACKED
FP_ARITH_INST_RETIRED.SINGLE
FP_ARITH_INST_RETIRED.DOUBLE

And their diff values in a seconds are "0 0 0 0 0 31684 1340700232 0.0 0.0 0.0".

I confused that how to undestand the events, some are zero, the other not.

And are the events had inclusion relation.

GHui · ‎08-26-2016

I have run the mkl/benchmarks/linpack/runme_xeon64 program.

The runme_xeon64 output the following message

Size LDA Align. Time(s) GFlops Residual Residual(norm) Check
1000 1000 4 0.052 12.9315 8.866796e-13 3.023805e-02 pass
1000 1000 4 0.008 82.7219 8.866796e-13 3.023805e-02 pass
1000 1000 4 0.007 93.5988 8.866796e-13 3.023805e-02 pass
1000 1000 4 0.007 92.9639 8.866796e-13 3.023805e-02 pass
2000 2000 4 0.033 164.2892 3.864797e-12 3.361900e-02 pass
2000 2000 4 0.027 200.3969 3.864797e-12 3.361900e-02 pass
5000 5008 4 0.167 499.0555 2.383066e-11 3.322993e-02 pass
5000 5008 4 0.190 438.7789 2.155309e-11 3.005404e-02 pass
10000 10000 4 0.974 685.0007 8.261911e-11 2.913233e-02 pass
10000 10000 4 0.906 736.1333 8.531753e-11 3.008383e-02 pass
15000 15000 4 2.516 894.5636 2.272723e-10 3.579576e-02 pass
15000 15000 4 2.760 815.4055 2.019905e-10 3.181385e-02 pass
18000 18008 4 4.663 834.0049 3.264814e-10 3.575372e-02 pass
18000 18008 4 4.587 847.6924 3.264814e-10 3.575372e-02 pass
20000 20016 4 5.986 891.1581 3.565633e-10 3.156367e-02 pass
20000 20016 4 6.009 887.7311 3.565633e-10 3.156367e-02 pass
22000 22008 4 7.569 938.0349 4.454127e-10 3.262473e-02 pass
22000 22008 4 7.541 941.4906 4.454127e-10 3.262473e-02 pass
25000 25000 4 10.524 989.9109 5.087659e-10 2.893169e-02 pass
25000 25000 4 10.488 993.3168 5.087659e-10 2.893169e-02 pass
26000 26000 4 11.710 1000.7430 5.944061e-10 3.125565e-02 pass
26000 26000 4 11.758 996.6501 5.944061e-10 3.125565e-02 pass
27000 27000 4 13.020 1007.9769 6.490156e-10 3.164930e-02 pass
30000 30000 1 17.293 1040.9949 8.272351e-10 3.260969e-02 pass

But I colloect these events only 274.324GFlops.

McCalpinJohn · ‎08-26-2016

How are you collecting these counts?

These events count instructions, not operations, so the first six need to be scaled by the corresponding width if you want an operation count. The documentation pointed to above clearly explains how many operations each increment corresponds to, and points out that for Multiply-Add operations the counter is incremented twice, so that operations are counted in the expected way (Multiply-Add = 2 operations).

The scaling should be:

FP_ARITH_INST_RETIRED.SCALAR_DOUBLE 1
FP_ARITH_INST_RETIRED.SCALAR_SINGLE 1
FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE 2
FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE 4
FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE 4
FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE 8

From inspection of the Umask values, the next two events are the sum of the single and double precision operations for each case. For the PACKED case it is not possible to get an operation count, since the single packed instructions correspond to a different number of operations than the double packed instructions.

FP_ARITH_INST_RETIRED.SCALAR
FP_ARITH_INST_RETIRED.PACKED

From inspection of the Umask values, the next two operations are the sum of scalar, packed 128 bit, and packed 256 bit operations for each width. It is not possible to get an operation count from any of these counters, since they combine instructions of different widths.

FP_ARITH_INST_RETIRED.SINGLE
FP_ARITH_INST_RETIRED.DOUBLE

For the xHPL code running on a Xeon E5 v4, almost all of the counts should be in the FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE category. These should be multiplied by 4 to get the FP operation count.

GHui · ‎08-31-2016

I set event 0x6310C7 to evtsel 0x18A, and get it from pmc 0xc5.

And I get zero counts.

GHui · ‎09-05-2016

Does v3 can also use these "FP_ARITH" events for counting flops?

McCalpinJohn · ‎09-06-2016

These events do not exist on Xeon E5 v3.

The 0xC7 event is not documented on Xeon E5 v3, but a quick test shows that it is counting something, and it looks like it is probably counting the 0xC7 SIMD events defined for the Nehalem/Westmere platform. These include arithmetic and non-arithmetic SIMD instructions, so they are not useful for counting FP operations.

GHui · ‎09-06-2016

How can I count FP operations on v3?

What events that I can use to count FP operations?

McCalpinJohn · ‎09-07-2016

There are no counters for floating-point operations on Xeon E5 v3.

The 0x10 and 0x11 events that counted floating point operations on Xeon E5 v1 and v2 suffered from a serious implementation bug that could lead to serious overcounting (I have measured up to 10x over-counts), so these were disabled on the Xeon E5 v3. Unfortunately the replacement 0xC7 events were not included until Xeon E5 v4, leaving Xeon E5 v3 with nothing.