I am using Vtune to profile some code on KNL. I use the HPC performance characterization. Vtune reports that FPU Usage Upper Bound metric and GFLOPS Upper Bound. I am a confusing with the definition of the metric. My code can achieve about 50+ GFLOPS on a single core, however the Vtune only reports GFLOPS upper bound 17.
For the FPU usage upper bound, the Vtune use the equation: (0/CPU_CLK_UNHALTED.THREAD+0/CPU_CLK_UNHALTED.THREAD)/(32*2)+(UOPS_RETIRED.PACKED_SIMD*8+UOPS_RETIRED.SCALAR_SIME)/CPU_CLK_UNHALTED.THREAD/32.
I am not quite understand the equation. Can anyone explain more. What does 32 mean? and the first element is always 0, why it is here.
On summary VTune shows metrics for the whole workload including initialization (warm-up) and finalization phase.
To measure computational part of your application you can go to "Bottom-Up" view select and "filter in" (using context menu) the computational part of your application).
"32" here means 8 doubles (512-bit AVX-512) * 2 (FMA) * 2 FPU units per core - the maximum number of double precision arithmetic operations that a physical core is capable to execute.
Since HW counters that are available on KNL don't give information on vector instruction size and masking we only can calculate "Upper Bound" assuming full vector size and no masking.
It turned out that the events we are based like UOPS_RETIRED.PACKED_SIMD might contain not only arithmetic operations (but for example vector stores) we are redoing FPU-related metrics in VTune 2017 U2 moving from FLOP- to instruction- based metrics.
Oh, one more comment.
To precisely measure FLOPs on KNL you can use Intel Advisor - they use instruction level dynamic instrumentation to calculate FLOPs - it has a bit more collection overhead but allows to take into account even masking.
Thanks & Regards, Dmitry
Thanks for your prompt reply. Will the FMA double the UOPS_RETIRED.PACKED_SIMDs retired? And how about the 0 in the equation? It seems that the whole element with 0 as the numerator can be removed.
One more question about the performance event on KNL. On KNC there is a metric that calculates the VPU intensity. But I do not find the same events to calculate the VPU intensity for KNL. The VPU utilization reported by Vtune gives limited info.
By the way do know whether there is a document that shows the details of PMU on KNL, like the document number 327357-001 for KNC.
FMA will double UOPS_RETIRED.PACKED_SIMD counter.
The comment on formula is correct.
Vector intensity metric cannot be calculated for KNL since it does not have a counter that allows to calculate a number of active elements in a vector instruction (as it was on KNC).
Thanks & Regards, Dmitry
One more question, according to the above explanation: "32" here means 8 doubles (512-bit AVX-512) * 2 (FMA) * 2 FPU and FMA will double UOPS_RETIRED.PACKED_SIMD counter.
Do those mean each FPU can execute up to 2 PACKED_SIMD micro operations in each cycle? And the UOPS_RETIRED.PACKED_SIMD can be 4 times as CPU_CLK_UNHALTED.THREAD?
Each FPU is only executing one PACKED_SIMD instruction per cycle, but if that instruction is an FMA this is typically counted as two Floating-Point operations -- ADD and MULTIPLY. For the Broadwell (and newer) processors, the new floating-point performance counter events increment twice for FMA instructions (https://download.01.org/perfmon/BDW/Broadwell_FP_ARITH_INST_V17.json).
On KNL the documentation in Volume 2 of the Xeon Phi Performance Monitoring Reference Manual (document 334480), does not suggest that an FMA instruction will increment the counter twice, nor does the documentation at https://download.01.org/perfmon/KNL/KnightsLanding_core_V9.json.