I'm performing the analysis

Peter_E_ · ‎03-09-2014

Hello,

I'm having problems with my analysis results in VTune. Whenever i do an analysis, hardware event-based metrics behave strangely. Their "aggregator" (e.g. Bad Speculation) seems to show correct values, but whenever i expand that, detailed metrics (e.g. Branch Mispredict, Machine Clears) always show zero (the blue bar is missing).

To illustrate, here is a C++ code snippet that should trigger loads of L1 cache misses:

/* ... */

/*kEvilOffset is 0x1000, accessing data with this offset should result in a large amount of L1D replacements (assuming a 32KB L1 data cache)*/

__declspec(noinline) void DoStuff() {
  for (size_t i = 0; i < kAllocSize - kEvilOffset; ++i)
  for (size_t j = i; j < kAllocSize; j += kEvilOffset) {
    data *= 39;
  }
}

/* ... */

After running a "General Exploitation (Sandy Bridge / Ivy Bridge / Haswell)" analysis, the blue bar for "Back-End Bound" is very wide (which - by the way - should be correct, since cache misses are in that category), but when i expand that category, there are no blue bars at all for any of those events. I also attached two screenshots. Am i doing something wrong?

I'm compiling with Intel C++ 14.0, and using VTune Amplifier XE 2013 Update 15 (build 328102).

Peter_W_Intel · ‎03-09-2014

Very good question.

1. You can use right-click on data -> Show data as : from Bar to Number, see exact data

2. Move the cursor on the column (metric) - you can see the explanation of this metric, and how data is generated according to metric's formula.

3. Each metric whatever it is upper or lower - they calculate the data by using different event.

Peter_E_ · ‎03-09-2014

The numeric representation shows 0.681 for the category "Back-End Bound", and 0.000 for every sub-category of it (it did not help :( ).

Peter_W_Intel · ‎03-09-2014

Peter E. wrote:

The numeric representation shows 0.681 for the category "Back-End Bound", and 0.000 for every sub-category of it (it did not help :( ).

My understanding is that metrics in sub-category (e.g. Memory Latency, Memory Replacements, Memory Reissues) will impact the performance on "Back-end Bound". However other factors (used in formula) , for example IDQ UOPS not delivered and UOPS issued are divided by CPU clocks - also will impacts on the Back-end performance.

David_A_Intel1 · ‎03-10-2014

Hi Peter E.:

The reality is that the events available to the VTune Amplifier XE from the processor, while vaguely indicating a back-end bound issue, do not provide enough information to pinpoint the problem. The newer processors are doing a better job. So, while VTune Amplifier is indicating a *potential* performance issue in the back-end, the problem does not fall into one of the sub-metrics defined for the back-end. That's the general answer.

Specifically, if you provide more details, we *may* be able to help. For example, what processor are you collecting this data on? And, can you share the results with us? (zip up the results directory and either attach them here or submit an issue at Intel® Premier Support).

Peter_E_ · ‎03-11-2014

I'm performing the analysis on a Core i7 2600k. Since i have a student license, i'm not eligible for Intel Premier Support (as far as i know). I attached both the analysis results, and the source code of the small test program i have run it on (needs a file named data.dat, content irrelevant).

Peter_W_Intel · ‎03-11-2014

Thanks for your result file & source. Actually there was no memory penalty in example - but I saw that high IDQ_UOPS_NOT_DELIVERED.CORE event count (Which caused Back-end Bound highlighted), see bottom-up report when you select this event in timeline report, will see high IDQ_UOPS_NOT_DELIVERED.CORE during 3.3s - so select 3.3s - 3.35s for time range to zoom-in and filter on selection to generate new report. You will see DoStuff() with high Back-end Bound, then double-click to view source line which is -

Source Line Source CPU_CLK_UNHALTED.THREAD CPU_CLK_UNHALTED.THREAD INST_RETIRED.ANY INST_RETIRED.ANY CPI Rate: Total CPI Rate: Self Retiring Bad Speculation Retiring Bad Speculation Back-end Bound Front-end Bound Back-end Bound Front-end Bound
21 for (size_t j = i; j < kAllocSize; j += kEvilOffset) { 3,840,000,000 3,840,000,000 5,268,000,000 5,268,000,000 0.729 0.729 0.373 0.007 0.373 0.007 0.613 0.011 0.613 0.011

This line is inner loop - it meant IDQ decoded uops not delivered to RAT (each clock only allows 4 uops to RAT), stall in RAT? Sometime, you can adjust algorithm or use Intel C/C++ composer to optmize it.

Erroneous detailed hardware metrics in VTune Amplifier XE 2013