VTune Top Down PMU Counters For Xeon Phi

CPati2 · ‎04-26-2018

Hi All,

With reference to Top Down approach using VTune, is there a way to identify which PMU performance counters are being used to calculate the retiring, bad speculation, front end and back end data? I have the formulas, but wish to specific counters used in Xeon Phi architecture?

Thanks,
Chetan Arvind Patil

Dmitry_R_Intel1 · ‎04-27-2018

The formulas are following:

Frontend_Bound = ( 2 * NO_ALLOC_CYCLES.NOT_DELIVERED ) / ( 2 * CPU_CLK_UNHALTED.THREAD )

Bad_Speculation = ( 2 * NO_ALLOC_CYCLES.MISPREDICTS ) / ( 2 * CPU_CLK_UNHALTED.THREAD )

Backend_Bound = 1 - ( Frontend_Bound + Bad_Speculation + Retiring )

Retiring = UOPS_RETIRED.ALL / ( 2 * CPU_CLK_UNHALTED.THREAD )

CPati2 · ‎04-27-2018

Hi Dmitry,

Thank you.

As per the documentation (https://download.01.org/perfmon/index/silvermont.html), I see Silvermont (KNL) has CPU_CLK_UNHALTED.CORE and not CPU_CLK_UNHALTED.THREAD as listed in the formula.

Can I use .CORE one instead of .THREAD?

Thanks.

Dmitry_R_Intel1 · ‎04-27-2018

KNL has Hyper Threading so CPU_CLK_UNHALTED.THREAD is the correct event.

CPati2 · ‎04-27-2018

Hi Dmitry,

Thanks.

The formulation you shared is for level 1. As per TAM metric documentation here, there are 3 more levels: 2, 3 and 4. PMU-Tools source here has Level 1 as you shared above, but the Level 2 in the same source describes formulation only for Frontend and not for other 3 section of the Level 1 (Bad Speculation, Retiring and Backend Bound).

If you have these details, can you please share?

Also, these are sampled raw values for level 1:

Cycles: 46160407239
Front End: 2508371589
Bad Speculation: 313053734
Retiring: 322725651

Backend Bound is: 93% while Retiring is just 0.34%. Isn't the cycle value too large? I am using a 16 threaded Caffe network and these are aggregate values.

Thanks.

Dmitry_R_Intel1 · ‎04-28-2018

Due to not-so-powerful PMU on KNL we have significantly more poor metrics there on level below 1 (comparing to big cores). See the full metrics table in attachment.

The formulas should result in numbers from 0 to 1 (VTune also multiplies them by 100 and shows as percentages). So what exactly are the numbers you posted? Could you please also show raw event values?

Also how your 16 threads maps to the topology - are they use 1 thread per physical core or more?

CPati2 · ‎04-28-2018

Hi Dmitry,

I am using 16 thread in Scatter mode with 1 thread per core.

I use Linux Perf to get counters and then perform post processing to get the percentage values. The numbers I shared above are raw events, and then I used the formulas and converted results in percentage values.

Performance counters are as follows:

CPU_CLK_UNHALTED.THREAD: 46160407239
NO_ALLOC_CYCLES.NOT_DELIVERED: 2508371589
NO_ALLOC_CYCLES.MISPREDICTS: 313053734
UOPS_RETIRED.ALL: 322725651

TAM:

Frontend Bound = 0.0543403262 = 5.4%

Bad Speculation = 0.00678186681 = 0.6%

Retiring = 0.00349569762 = 0.3%

Backend Bound = 1 - (0.0543403262 + 0.00678186681 + 0.00349569762) = 0.935382109 = 93.5%

Questions:

Are above values expected? I just want to ensure whether my approach is correct or not.
I tried the toplev from pmu-tools on the system and I get similar values where backend bound dominates.
For frontend bound and bad speculation the formulas has 2 in both numerator and denominator. Any specific reason? As both of these get cancelled out.
For MemoryLatency and MemoryReissues, the metric file you shared has "Grid" as option. Does that mean no formula for these level 2, instead use level 3?
What is the meaning of last column "Threshold"?

Thank you for sharing the document, it's helpful.

Thanks.

Dmitry_R_Intel1 · ‎04-28-2018

- Such low Retiring looks suspicious I agree. What is the value of INST_RETIRED.ANY event?

- The toplev tool should use exactly the same formulas as VTune. So yes this is expected

- This is just to emphasize that formulas structure is <metric pipeline slots> / <total pipeline slots>

- The 'Grid' here means that this is just grouping node, without any numerical value

- The Threshold defines criteria to say when the given metric represents a potential issue and it is worth looking more attentively into it. E.g. in VTune we highlight metrics which break threshold and provide special tooltips with tuning advises/next steps.

CPati2 · ‎04-28-2018

Hi Dmitry,

I wasn't logging INST_RETIRE.ANY, with new runs following are the values:

CPU_CLK_UNHALTED.THREAD: 33141162373
NO_ALLOC_CYCLES.NOT_DELIVERED: 1487081586
NO_ALLOC_CYCLES.MISPREDICTS: 993832863
UOPS_RETIRED.ALL: 4317014234
INST_RETIRE.ANY: 1897330410

I am sampling the workload every 1 sec, above is the one of the sampled value counter. I have data for full run of the workload, but I don't think aggregate values of samples leading to drastic change compared to the 1 second sampled trace.

If you have KNL machine, do you see similar trend irrespective of the workload? I can run specific workload you may have data for and that will help cross check?

Thanks.

CPati2 · ‎04-28-2018

Hi Dmitry,

I was totally wrong in just analyzing the 1 second sample. Since, I was grabbing the initial sample of the workload run, it seems the values were in favor of bad speculation as the workload was still getting setup.

After analyzing all samples (average) I get acceptable values. Sorry for the confusion.

Question:

Why is there no level 4 for KNL?
LLCHitRateKNL, LLCHitKNL, LLCMissKNL should use counters with "_PS" at the end, as that is what is supported, there aren't any events that end with "_PS" for these. For example: mem_uops_retired.l1_miss_loads is valid and mem_uops_retired.l1_miss_loads_ps is not?
Above is true for SplitLoadsKNL and LoadsBlockedbyStoreForwardingKNL also. If I remove "_PS" at the end, I can see events in perf.
I see "MACHINE_CLEARS.FP_ASSIST" giving zero as counts, is that expected?
How did you came up with these blocks for level 2 and 3? As these differ from TAX excel sheet here.
How can I understand details of each blocks in a level? I can refer the TAM excel sheet, but that is specific to Xeon architectures like Skylake etc.

Thanks.

CPati2 · ‎05-01-2018

Hi Dmitry,

Following "precise" events have not been patched to Linux Perf (link). Can you please help me with events and umask of these?

MEM_UOPS_RETIRED.L2_HIT_LOADS_PS
MEM_UOPS_RETIRED.L2_MISS_LOADS_PS
RECYCLEQ.LD_SPLITS_PS
RECYCLEQ_LD_BLOCK_ST_FORWARD_PS

I can then patch details at these KNL json files to get counter data and TMA analysis. I couldn't find relevant details in PMU documentation of Xeon Phi.

Thanks.

Dmitry_R_Intel1 · ‎05-10-2018

You can find KNL events here: https://download.01.org/perfmon/KNL/KnightsLanding_core_V9.json

Note that the '_PS' suffix doesn't affect event code and umask. It is just a notion for the tool to configure PEBS buffer for this event and get additional information from there (usually this is just a precise sample IP which replaces interrupt IP).

CPati2 · ‎05-10-2018

Hi Dmitry,

Thanks.

Should I expect the level 2/3 values to have aggregate values equal to that of level 1, then level 3 values to add up to level 2? That is often not the case for me.

For Xeon Servers (non-KNL), the Backend Bound is clearly divided into level 2 "Memory Bound" and "Core Bound". For Xeon Phi (KNL), the level 2 has "Memory Latency" and "Memory Reissues", so should I consider these as memory bound and core bound respectively?

Thanks.

Dmitry_R_Intel1 · ‎05-11-2018

No for KNL level 2 and 3 will not add up to higher levels. Interpret them as weights - what is bigger ii probably worth looking into first.

There is currently no direct way to get Memory Bound vs Core Bound breakdown on KNL unfortunately. Both "Memory Latency" and "Memory Reissues" are related to memory. You can only guess that if you have nothing big under them but the Back-End Bound is high - then probably you have core bound issues.

Please also check our tuning guide for KNL if you haven't done this yet: https://software.intel.com/sites/default/files/managed/1f/eb/Using_Intel_VTune_Amplifier_XE_on_Knights_Landing_1.1.pdf

CPati2 · ‎05-16-2018

Hi Dmitry,

Thank you.

Is it due to the Silvermont/KNL architecture that TMA's backend bottleneck is not specifically divided into core and memory bound? Why the TMA shows level 3 that is more focused on memory and not core?

I am interested more in core bound bottleneck using TMA and that too for Silvermont/KNL. Is there any other performance counter that I can use to achieve this. I have all data I need just not core bound bottleneck, without data it's difficult to come to conclusion even if memory bound is higher or lower.

Please suggest.

Thanks.