Turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Intel Community
- Software
- Software Archive
- FPU Usage Upper Bound and GFLOPS Upper Bound

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

Zhen

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

12-08-2016
09:00 AM

92 Views

FPU Usage Upper Bound and GFLOPS Upper Bound

I am using Vtune to profile some code on KNL. I use the HPC performance characterization. Vtune reports that FPU Usage Upper Bound metric and GFLOPS Upper Bound. I am a confusing with the definition of the metric. My code can achieve about 50+ GFLOPS on a single core, however the Vtune only reports GFLOPS upper bound 17.

For the FPU usage upper bound, the Vtune use the equation: (0/CPU_CLK_UNHALTED.THREAD+0/CPU_CLK_UNHALTED.THREAD)/(32*2)+(UOPS_RETIRED.PACKED_SIMD*8+UOPS_RETIRED.SCALAR_SIME)/CPU_CLK_UNHALTED.THREAD/32.

I am not quite understand the equation. Can anyone explain more. What does 32 mean? and the first element is always 0, why it is here.

Thanks!

Link Copied

8 Replies

Dmitry_P_Intel1

Employee

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

12-08-2016
09:52 AM

92 Views

Hello,

On summary VTune shows metrics for the whole workload including initialization (warm-up) and finalization phase.

To measure computational part of your application you can go to "Bottom-Up" view select and "filter in" (using context menu) the computational part of your application).

"32" here means 8 doubles (512-bit AVX-512) * 2 (FMA) * 2 FPU units per core - the maximum number of double precision arithmetic operations that a physical core is capable to execute.

Since HW counters that are available on KNL don't give information on vector instruction size and masking we only can calculate "Upper Bound" assuming full vector size and no masking.

===

It turned out that the events we are based like UOPS_RETIRED.PACKED_SIMD might contain not only arithmetic operations (but for example vector stores) we are redoing FPU-related metrics in VTune 2017 U2 moving from FLOP- to instruction- based metrics.

Dmitry_P_Intel1

Employee

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

12-08-2016
09:56 AM

92 Views

Oh, one more comment.

To precisely measure FLOPs on KNL you can use Intel Advisor - they use instruction level dynamic instrumentation to calculate FLOPs - it has a bit more collection overhead but allows to take into account even masking.

Thanks & Regards, Dmitry

Zhen

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

12-08-2016
10:29 AM

92 Views

Hi dmitry-prohorov,

Thanks for your prompt reply. Will the FMA double the UOPS_RETIRED.PACKED_SIMDs retired? And how about the 0 in the equation? It seems that the whole element with 0 as the numerator can be removed.

One more question about the performance event on KNL. On KNC there is a metric that calculates the VPU intensity. But I do not find the same events to calculate the VPU intensity for KNL. The VPU utilization reported by Vtune gives limited info.

By the way do know whether there is a document that shows the details of PMU on KNL, like the document number 327357-001 for KNC.

Thanks!

Dmitry_P_Intel1

Employee

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

12-09-2016
12:44 AM

92 Views

Hello Zhen,

FMA will double UOPS_RETIRED.PACKED_SIMD counter.

The comment on formula is correct.

Vector intensity metric cannot be calculated for KNL since it does not have a counter that allows to calculate a number of active elements in a vector instruction (as it was on KNC).

Thanks & Regards, Dmitry

Zhen

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

12-09-2016
11:12 AM

92 Views

Thanks Dmitry！

Zhen

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

12-10-2016
05:51 PM

92 Views

Hello Dmitry,

One more question, according to the above explanation: "32" here means 8 doubles (512-bit AVX-512) * 2 (FMA) * 2 FPU and FMA will double UOPS_RETIRED.PACKED_SIMD counter.

Do those mean each FPU can execute up to 2 PACKED_SIMD micro operations in each cycle? And the UOPS_RETIRED.PACKED_SIMD can be 4 times as CPU_CLK_UNHALTED.THREAD?

Best regards,

Zhen

McCalpinJohn

Black Belt

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

12-12-2016
12:04 PM

92 Views

Each FPU is only executing one PACKED_SIMD instruction per cycle, but if that instruction is an FMA this is typically counted as two Floating-Point operations -- ADD and MULTIPLY. For the Broadwell (and newer) processors, the new floating-point performance counter events increment twice for FMA instructions (https://download.01.org/perfmon/BDW/Broadwell_FP_ARITH_INST_V17.json).

On KNL the documentation in Volume 2 of the Xeon Phi Performance Monitoring Reference Manual (document 334480), does not suggest that an FMA instruction will increment the counter twice, nor does the documentation at https://download.01.org/perfmon/KNL/KnightsLanding_core_V9.json.

Zhen

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

12-13-2016
04:31 PM

92 Views

I see. Thanks John!

Topic Options

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

For more complete information about compiler optimizations, see our Optimization Notice.