I hope you won't mind if my answer is structured somewhat different from your original questions.
1. Bad news first: There's no way of registering every branch (even only mis-predicted ones) without significant performance overhead. The explanation is quite simple: Typical workloads execute a branch per ~10 instructions; it takes about ~2000 cycles to process PMI (and several hundred cycles to store PEBS/BTS records plus a couple thousand to process PMI eventually), which roughly equals to a couple thousand arithmetic instructions' execution time, so we should have 200x overhead in the worst case. Given the average mis-prediction rate is 1%, the resulting overhead is in the range of ~2-3x.
2. It was more correct when you tried to correlate PEBS records for the precise C5 event (mis-predicted branches retired) with BTS records, rather than correlating exec events with LBRs, because the branches you counted with exec events were not necessarily executed to retirement, which means there may be no address recorded in both LBRs and BTS buffer. The LBRs in this case will just give you logical proximity, that is, indicate that so many branches were mis-predicted while executing certain loop or call chain (which loop and call chain - that's the question to be answered by means of binary or source code analysis).
3. With the above said, I have two suggestions:
(a) Continue with PEBS on C5 and program the PEBS buffer to overflow after collecting a single element, set bit 11 in IA32 Debug Control MSR to freeze LBRs on counter overflow, and find a matching source address (for your recorded destination) in the LBR stack. You'll have to tolerate the overhead in this case.
(b) Switch to a statistical approach and perform event-based sampling over ~100 8E events (all indirect mis-predicted branches), read LBRs in your PMI handler and try to correlate the data with your source code. You'll have to sacrifice the precision in this case.
By the way, you may use the same statistical approach with (a) to enable easier correlation and minimize overhead (at the expense of data precision).
Hope this helps, and sorry about the bad news,