Branch predictor weird behaviour - how it work?

NC_s_ · ‎04-28-2015

Hi,

I try to optimize a software that uses a lot of branches so I try to understand how the branch predictor and the BTB work.
For that, I use the performances counters(I enabled it by a kernel module).
More precisely, I use two events for the counters on an i7 CPU:
- Branch Instruction Retired ( event select : C4H, Umask : 00H)
- Branch Misses Retired (event select : C5H, Umask : 00H)
Also, I count these events only on ring 3( the OS flag is set to 0 for each IA32_PERFEVTSELx MSRs).

I'm doing two experiments.
The first one :

Between two rdpcm instruction in ring 3, I wrote this code :

mov $1,%r10
loop0:
dec %r10
jz loop0

mov $1,%r10
loop1:
dec %r10
jz loop1

mov $1,%r10
loop2:
dec %r10
jz loop2

... (the block is repeated 100000 times)

when I execute this code, performance counters give me 200293 branches and 98863 mispredicted branches

Because each branch is executed twice and because there are 100000 branches, there is about 200000 executed(retired) branches.

I explain 98863 as follow :
The first time the predictor detect the branch on one loop, it is not in the BTB so the static predictor is used and, because the target is before the branch, the branch is predicted to be taken according the Optimization Reference Manual.
In this case the prediction is correct. When the branch is executed a second time it is also predicted as taken but this time the branch is not taken and there is a misprediction.
Because it is not taken, the BTB is not updated and the same story happens to all loops : 1 misprediction over 2 : 98863/200293 is near 1/2 . So I can understand these results but please tell me if my interpretation is wrong.

the following second experiment is more strange to me:

-------

second experiment: I changed 'jz' to 'jnz'

Between two rdpcm instruction in ring 3, I wrote this code :

mov $1,%r10
loop0:
dec %r10
jnz loop0

mov $1,%r10
loop1:
dec %r10
jnz loop1

mov $1,%r10
loop2:
dec %r10
jnz loop2

... (the block is repeated 100000 times)

I expected the following behaviour :
0) on each loop, the branch is executed once so the performance counter will count about 100000 executed branches.
1) when the predictor see 'jnz', it is not in the BTB so the static predictor predicts that the branch to be taken
2) the branch is not taken in the code so a misprediction occurs and the BTB is not updated because it is not taken.
3) there is a misprediction at each loop so i expect about 100000 mispredictions.

the performance counters give me 1002929 executed branches : that's OK! but it gives me only 606 misprediction and that is what I don't understand. Why there is so few mispredictions?
I thought this is because the CPU understand that the branches are never taken, but in this case why it doesn't understand that ,on the first experiment, a branch is always "taken" then "not taken"?

Thank you !

McCalpinJohn · ‎05-21-2015

This is a bit outside of my area of expertise, but I think that a branch is only counted as "mispredicted" if there is actually a "prediction" from the BTB. In other words the static branch "predictions" don't count as predictions for the purposes of this performance counter event.

You should be able to verify this with a few additional experiments.

TimP · ‎05-21-2015

I'm less of an authority than John, but I believe many CPUs have been designed such that the static prediction of a backward branch would be for it to be taken (because the most important use is for controlling a loop of non-zero count).

With such a short loop, the BTB may have sufficient history after a few times through that it can predict the number of iterations of the short inner loops. So it need only mispredict the number of iterations a few times.

In my experience, significant misprediction rate on non-random branching happens mainly when there are too many branches for BTB capacity.

Bernard · ‎05-27-2015

As far as my understanding goes backward branches are usually predicted as a taken and forward branches are predicted to be not taken.Usually at some point BTB will need to have a history of taken/not taken forward branches and will try to predict the outcome by looking up BTB history. After 1e+5 iterations I think that BTB has enough history of such a short loop and simple backward branch that possibly it can mispredict only those few hundred branches which occurres at the begining of inner loop iterations.

McCalpinJohn · ‎05-27-2015

The code above is a very long sequence of consecutive loops that either branch backwards once (in the first case) or never (in the second case). The Branch Target Buffer tracks branch instructions by program counter. Since each of these branches is at a different address, each branch will be found in the BTB exactly once in the first case and not at all in the second case.

I recall reading somewhere that "mispredicted branches" only counted BTB predictions and not static predictions, but I can't find the reference now.