Analyzing overhead of polymorphic code- no branching metrics available? - Page 3

T_C · ‎02-23-2014

Hi,

I am analyzing the different between two designs which process millions of messages. One design uses polymorphism and the other doesnt- each message will be represented by a polymorphic sub type.

I have profiled both designs using VTune. The High-level summary data seems to make sense- the polymorphic design has a higher "branch mispredict" rate, higher CPI and higher "ICache misses" rate than the non-polymorphic version implemented with IF statements.

The polymorphic design has a line of source code like this:

object->virtualFunction();

and this is called millions of times (where the sub type changes each time). I am expecting the polymorphic design to be slower because of branch target mispredictions/instruction misses. As said above, the VTune "summary" tab seems to confirm this. However, when I go to the metrics next to the line of source code there are absolutely no metrics except for:

Filled pipeline slots total -> Retiring -> General retirement
Filled pipeline slots self -> Retiring -> General retirement
Unfilled pipeline slots total -> Front end bound -> Front end bandwidth -> Front end bandwidth MITE
Unfilled pipeline slots self -> Front end bound -> Front end bandwidth -> Front end bandwidth MITE

None of the branch prediction columns have data, nor do the instruction cache miss columns??

Could somebody please comment on whether this seems sensible? To me it doesn't- how can there be no branch misprediction or instruction cache miss statistics for a line of polymorphic code where the branch target will constantly be changing per message?

T_C · ‎03-01-2014

Peter Wang (Intel) wrote:

Column "Branch Mispredict" is showed in bottom-up report. Column "BR_INST_RETIRED.ALL_BRANCHES" is showed in source view report. I'm using Update 15.

First one is metric, and second one is event - sorry that I have to show your event BR_MISP_EXEC.ANY - see attached screen-shot, all branch misprediction at line 65: p->f();

Hi Peter,

Do you have to configure anything to see the column BR_INST_RETIRED.ALL_BRANCHES? I dont have it in my source-view report, I only have "Branch mispredict" twice, always under one of the two "Bad Speculation"s (see my attachment).

Peter_W_Intel · ‎03-01-2014

T C wrote:

Quote:

Peter Wang (Intel) wrote:
Column "Branch Mispredict" is showed in bottom-up report. Column "BR_INST_RETIRED.ALL_BRANCHES" is showed in source view report. I'm using Update 15.

First one is metric, and second one is event - sorry that I have to show your event BR_MISP_EXEC.ANY - see attached screen-shot, all branch misprediction at line 65: p->f();

Hi Peter,

Do you have to configure anything to see the column BR_INST_RETIRED.ALL_BRANCHES? I dont have it in my source-view report, I only have "Branch mispredict" twice, always under one of the two "Bad Speculation"s (see my attachment).

Yes. I have same result (column "Branch Mispredict") when displaying source view on SandyBridge processor.

My prior report of source view displays events - I worked on Nehalem processor. It depends on your processor type - I mean, source view report displays all column in events (instead of metrics) on Nehalem processor,

T_C · ‎03-07-2014

Bit of an update- I have just borrowed a Nehalem laptop. The good news is that I can actually see the raw event metric columns. However, I am not totally convinced VTune consistently works (shock horror).

For starters, why do some event counts (BP_INST_RETIRED.NEAR_CALL) appear adjacent to the line of the function definition yet others appear adjacent to the last bracket of the function (BR_INST_RETIRED.ALL_BRANCHES)? The function I am talking about is a simple:

void myFunc(shared_ptr<X> x){     //BP_INST_RETIRED.NEAR_CALL is non-zero here
    mutex.lock();
    x->virtualCall();             //None of the 3x Nehalem "BR_" event counters in the GUI have values here???
    mutex.unlock();
}                                 //BR_INST_RETIRED.ALL_BRANCHES is non-zero here

It just doesn't make a lot of sense. In fact the only event counter which has a value for the virtual call line is MEM_LOAD_RETIRED.L1D_HIT.

So- even with a Nehalem processor I cannot see how VTune can allow me to determine the % of polymorphic branch target mispredictions vs predictions? :s

Peter_W_Intel · ‎03-13-2014

More comments:

1. General exploration analysis needs application to run longer, i.e, 3-5 minutes. I modified original test case.

2. Using rand() instead of __rtdsc() will be much better - I checked all TSC values in a trace I had collected locally, and all of TSCs had bit 0 cleared! So there may be systems where the mispredictions would appear more (or less) frequently (between Nehalem and Sandybridge). You can try my attached test case, results are same for Branch Mispredict.

3. You may increase more Branches and enlarge their bodies (keep Branch Entries not in same I-Cache line) in main(), or adjust their sampling intervals in accordance with the expected number of mispredicts (create a custom analysis to modify SAV)

Regards, Peter

Peter_W_Intel · ‎03-13-2014

If you changed SAV, you should see Branch Mispredict counts on main(), I mean to open result by using amplxe-gui. For example -

# amplxe-cl -collect-with runsa -knob event-config=BR_MISP_EXEC.ALL_BRANCHES:sa=5000 -- ./ploymorphic_2

Using command line with events directly, please see this article.

T_C · ‎03-14-2014

Peter Wang (Intel) wrote:

More comments:

1. General exploration analysis needs application to run longer, i.e, 3-5 minutes. I modified original test case.

2. Using rand() instead of __rtdsc() will be much better - I checked all TSC values in a trace I had collected locally, and all of TSCs had bit 0 cleared! So there may be systems where the mispredictions would appear more (or less) frequently (between Nehalem and Sandybridge). You can try my attached test case, results are same for Branch Mispredict.

3. You may increase more Branches and enlarge their bodies (keep Branch Entries not in same I-Cache line) in main(), or adjust their sampling intervals in accordance with the expected number of mispredicts (create a custom analysis to modify SAV)

Regards, Peter

Hi- I'm a bit confused why the application needs to run between 3 and 5 minutes- surely VTune is either measuring using the event counters or it isn't?

My actual code (not the example I provided) is a fairly large application. I just ran it for 4 minutes and the object->virtualMethod() line still didnt have any branch misprediction.

One thing which is really puzzling is why lines of code which are simply function closing brackets "}" have metrics for that line? It wasn't the sum of the column metrics for all lines in that function.

Peter_W_Intel · ‎03-14-2014

> I'm a bit confused why the app

lication needs to run between 3 and 5 minutes- surely VTune is either measuring using the event counters or it isn't?

Peter: ask the application to run longer that VTune can capture more samples for events, e.g. BR_MISP_EXEC.ALL_BRANCHES

> My actual code (not the example I provided) is a fairly large application. I just ran it for 4 minutes and the object->virtualMethod() line still didnt have any branch misprediction.

Peter: the reason could be:

1. Please understand that "Branch Mispredict" metric's formula = 20 * BR_MISP_EXEC.ALL_BRANCHES / CPU_CLK_UNHALTED.THREAD. That is, each Branch Mispredict will have 20 cycles penalty. However, if CPU clock is bigger, "Branch Mispredict" still is zero.

2. General Exploration captures many events. In one sampling session, multiplexing technology will be used and BR_MISP_EXEC.ALL_BRANCHES will be reduced. That is why I recommend to use command line to change SAV. a) Don't capture other events b) Capture more BR_MISP_EXEC.ALL_BRANCHES event counts

I have tested "amplxe-cl -collect-with runsa -knob event-config=BR_MISP_EXEC.ALL_BRANCHES:sa=5000 -- ./ploymorphic", your original code - BR_MISP_EXEC.ALL_BRANCHES counts are displayed in main() in report. It should be workable with real project, if code has Branch Mispredict above 5000.

Again, I am not guaranteed that Metric "Branch Mispredict" is not zero - it depends on your how it impacts on performance. Actually, you can calculate this by yourself, by formula: 20 * BR_MISP_EXEC.ALL_BRANCHES / CPU_CLK_UNHALTED.THREAD

Using events directly never display any metric, but metric can be calculated by yourself - the formula was from general exploration analysis, you move cursor to the column of the metric to display the metric.

Is there any confusion, yet?