Frontend Bound and Branch Mispredicts will overlap?

Zhiwei_C_ · ‎04-02-2018

Hi all,

If a program has many branches, and speculation is very bad. Will it show big percentage both in Frontend Bound and Branch Mispredicts?

Because I found Frontend will count unused slots to RAT and Bad Speculation will count recovery cycle. If a program has so many branches mispredict, it will cause many unused slots, at the same time, it will cause the big count of the recovery cycle. Will they overlap?

Thank you:)

Dmitry_R_Intel1 · ‎04-02-2018

Hi,

Yes it is possible for the Frontend metric to be quite high for the code with a lot of mispredict branches.

Specifically VTune has "Branch Resteers" metric under FE Bound to account for this. Let me copy-past the metric description:

"... Branch Resteers estimates the Frontend delay in fetching operations from corrected path, following all sorts of misspredicted branches. For example, branchy code with lots of misspredictions might get categorized under Branch Resteers...."

Zhiwei_C_ · ‎04-02-2018

Dmitry Ryabtsev (Intel) wrote:

Hi,

Yes it is possible for the Frontend metric to be quite high for the code with a lot of mispredict branches.

Specifically VTune has "Branch Resteers" metric under FE Bound to account for this. Let me copy-past the metric description:

"... Branch Resteers estimates the Frontend delay in fetching operations from corrected path, following all sorts of misspredicted branches. For example, branchy code with lots of misspredictions might get categorized under Branch Resteers...."

As far as know, Frontend Bound + Bad Speculation + Backend bound + Retiring = 100%. If Frontend Bound and Bad Speculation overlap, will make other percentages too small to be not unreliable?

IDQ_UOPS_NOT_DELIVERED.CORE / (4*CPU_CLK_UNHALTED.THREAD)

（UOPS_ISSUED.ANY - UOPS_RETIRED.RETIRE_SLOTS + 4 * INT_MISC.RECOVERY_CYCLES） / SLOTS

Will IDQ_UOPS_NOT_DELIVERED.CORE count when INT_MISC.RECOVERY_CYCLES happen?If yes, they must overlap.

Dmitry_R_Intel1 · ‎04-03-2018

No they should not overlap. The Frontend Bound + Bad Speculation + Backend bound + Retiring will still be 100%. Just some of the stalled pipeline slots will be classified as due to Frontend instead of Bad Speculation. But the "Branch Resteers" node under Frontend will hint that the reason is actually branch mispredicts.

Zhiwei_C_ · ‎04-03-2018

Dmitry Ryabtsev (Intel) wrote:

No they should not overlap. The Frontend Bound + Bad Speculation + Backend bound + Retiring will still be 100%. Just some of the stalled pipeline slots will be classified as due to Frontend instead of Bad Speculation. But the "Branch Resteers" node under Frontend will hint that the reason is actually branch mispredicts.

OK, you said "some of the stalled pipeline slots will be classified as due to Frontend instead of Bad Speculation" I can understand. Because the recover must cause many cycles that Frontend issue 0 slot to Backend. And it will reflect on the "Branch Resteers".

But, "Bad Speculation" = （UOPS_ISSUED.ANY - UOPS_RETIRED.RETIRE_SLOTS + 4 * INT_MISC.RECOVERY_CYCLES ） / SLOTS.

INT_MISC.RECOVERY_CYCLES the explanation is "Core cycles the Resource allocator was stalled due to recovery from an earlier branch misprediction or machine clear event".

The "Resource allocator stalled cycles" mean the cycles form mispredicted branch instruction flushed to this branch instruction take in RS, so ti must include the cycles that Frontend resteer this instruction. I don't know if I understand it correctly.

I tested "int_misc_recovery_cycles " and "int_misc_clear_resteer_cycles " counter in some cases, found "int_misc_recovery_cycles " alway biger than "int_misc_clear_resteer_cycles ". So I think "int_misc_recovery_cycles " will overlap Frontend stall cycles, that is "Bad Speculation" will overlap "Frontend Bound". Rather than " stalled pipeline slots will be classified as due to Frontend instead of Bad Speculation", it will be classified both Frontend Bound and Bad Speculation.

I don't know if I made any mistakes.

Dmitry_R_Intel1 · ‎04-03-2018

The top-level 'Front-End Bound' node is based on the IDQ_UOPS_NOT_DELIVERED.CORE event. The description of this event is following: "Uops not delivered to Resource Allocation Table (RAT) per thread when backend of the machine is not stalled". I think the "when backend of the machine is not stalled" is what makes these metrics not overlap.

Zhiwei_C_ · ‎04-04-2018

Dmitry Ryabtsev (Intel) wrote:

The top-level 'Front-End Bound' node is based on the IDQ_UOPS_NOT_DELIVERED.CORE event. The description of this event is following: "Uops not delivered to Resource Allocation Table (RAT) per thread when backend of the machine is not stalled". I think the "when backend of the machine is not stalled" is what makes these metrics not overlap.

I know that, IDQ_UOPS_NOT_DELIVERED.CORE: "Count issue pipeline slots where no uop was delivered from the front end to the back end when there is no back-end stall. "

Now the question is, whether the "back-end stall" include recover cycles.

In the "64-ia-32-architectures-optimization-manual", the description of the Front-end Bottleneck is following: " Front-end bottleneck occurs when front-end of the machine is not delivering uops to the back-end and the band-end is not stalled.Cycles where the back-end is not ready to accept micro-ops from the frontend should not be counted as front-end bottlenecks even though such back-end bottlenecks will cause allocation unit stalls, eventually forcing the front-end to wait until the back-end is ready to receive more uops."

And in the paper "A Top-Down Method for Performance Analysis and Counters Architecture", mentioned "A backend-stall is a backpressure mechanism the Backend asserts upon resource unavailability (e.g. lack of load buffer entries).".

So, I think when recovering happen, bcakend surely can accept uops from the frontend. That is, IDQ_UOPS_NOT_DELIVERED.CORE will countinue to count.

Dmitry_R_Intel1 · ‎04-04-2018

Well I agree that this is ambiguous and the documentation on events is not sufficient to get a definite answer. So we probably need someone who knows well the PMU internals (I'll try to reach such people but it may take time).

Still I think it is quite probable that IDQ_UOPS_NOT_DELIVERED.CORE is not incremented during recovery. And the current formulas in Top-Down seem to assume this.

Zhiwei_C_ · ‎04-04-2018

Dmitry Ryabtsev (Intel) wrote:

Well I agree that this is ambiguous and the documentation on events is not sufficient to get a definite answer. So we probably need someone who knows well the PMU internals (I'll try to reach such people but it may take time).

Still I think it is quite probable that IDQ_UOPS_NOT_DELIVERED.CORE is not incremented during recovery. And the current formulas in Top-Down seem to assume this.

Ok, I'll wait for your reply.

Thank you for your patiently reply:)