I am an undergrad workign on a performance profiling project. I specifically am measuring branch-miss impact on a bit of code using the Amplifier XE 2013 suite (vTune). I have found out where the highest branch miss rates occur.
My current goal is to come up with some kind of confirmation that this is indeed where the misses are happening. My section of code contains 27 branch-like statements (if, else if) that are condition based. I have successfully found a way to change these conditional branches into indirect jumps.
We are doing this on i5 Sandy Bridge 2400 chips running under Ubuntu 12.10. My understanding of Intel branch prediction breaks down into 2 parts. The predictor and the target buffer. The predictor does just that, predicts taken or not taken on conditional jumps using the branch history table and other means. The target buffer predicts WHERE that jump is going based on where it went last time that specific jump was encountered.
With my modified code, I have successfully bypassed the branch predictor (no condition to test for a jump). This is confirmed via dissassembly, I see a jmpq *(register). However, my indirect jumps are still victim to misses in the branch target buffer, and what I have noticed is that my runtime is actually WORSE by switching my conditional branches over to indirect jumps. This does make sense because I am calculating my jump address (hence the indirectness) and therefore the BTB has a much harder time prediciting it(large number of possible addresses I'm jumping to).
My Question: Is there a way in vTune that I can profile branch misses more directly. I want to know WHERE I'm missing (branch predictor or BTB). If you consider a basic conditional, there are 4 outcomes. The branch predictor with taken/not taken, then each of those has a hit/miss with the BTB. If I could show that my misses in the BTB are increasing and the misses in the branch predictor are decreasing as I move my conditionals over to indirects, then I can accomplish my goal.
Any input is appreciated. Please let me know if I have left anything out. And thank you!
I posted a link by it trigerred anti-spam filter.
Required formula for calculation of branch prediction rate Rate = (100-((Mispredicted Branches Retired/Branches Retired)*100))
Yes, I have the misprediction formulas. I want to know where the misses are occuring. I understand that they can miss in either the branch predictor or the branch target buffer.
vTune can tell me the code location. I am looking for confirmation. I have two versions of code. One that uses conditional branches (if statments) and one that uses indirect jumps. I want to compare where the misses are occuring on both of them. In the conditional version, misses should be occuring in the predictor and the btb(although rarely). In the version that uses indirect jumps, I should have no misses in the predictor and all of the misses in the BTB. This also serves to confirm that the code vTune identifies is indeed the bottleneck. I have already identified where the misses are occuring, I want to know how.
Now I understand your question:) I cannot find in Intel manuals any exact counter setting which will inform you about either btp or btb.I mean which hardware unit mispredicted the branch.
Intel doesn't comment on BTB accesses, at least they don't document or reply in this forum about this. I built some tests to look at the perfomrnace of my cpu and found the BPUCLEAR and BACLEAR of importance. There's 2 variants of BPUCLEAR, and they probably delineate the L1 and L2 BTB sizes (I suspect based upon my findings) and the BACLEAR is for making predictions in decode if they find there's a branch they have not found in the BTB and from the "global" history of all branches their predictor wants taken (that's why it's an 8 cycle penalty rather than the stiff 22-23 cycle penalty for a redirect from the ILD). Hope that helps.. below I pasted the link where I was inquiring about this:
Also.. "there's no real formula for the importance of when branch mispredicts are important, in my opinion". 2 mispredicts per thousand instructions (pti) when you're at 1 ipc or 4 ipc are very different from a performance lever perspective. If you ascribe a penalty of ~50clks to those 2 mispredicts pti then in the 1 ipc case this reduces the clks pti from 1000 to 950 and makes the new ipc 1000/950 ~ 1.05, but in the 4 ipc case the clks pti goes from 250 to 200 and your new ipc is 5, up from 4, which is a 25% boost. yes, you can't get 5 ipc in SB/IB.. but still you get the point.
Thank you very much for the reply. I am slowly coming to the same conclusion. However, I did find some vTune evetns for measuring indirect and conditional branches seperately (only taken however).
You can get all the branch mispredicts from Intel, Ret, Jcc, Direct JMP, Indirect Jump, etc and the # taken/not-taken and for each of those, the number of mispredicts. Intel's really got alot of information on branch performance, for the reason it is one of the largest drivers of performance. The 2 links below are all you need to know, but I can't attest as to what Intel's tools do for you, I have my own I wrote and use:
Thats what I was looking for! Doesn't appear to be in the optimization manual.
What is the difference between those to links? The event mask is different (88 and 89) but the sub events and their descriptions are identical.
Also, what is the difference between an indirect branch near call and indirect branch not near call?