I need some suggestions and have a few questions on performance monitoring on Core 2 Duo 6400.
I'm trying to collect source and target addresses of every mis-predicted indirect branches (including return) with minimal performance overhead. I've used PEBS mechanism (event number C5) to record information and read op-code to find out indirect branches. However PEBS records target of mis-predicted branch instruction. Hence I've tried other ways to collect both source and target information:
1. BTS with USR only: Since Core2 Duo does not have branch status bit in BTS records format I used BTS with PEBS (event number 0xC5) . It incurs 5x slow down for speccpu2006 integer benchmarks on average.
2. Performance Counter with LBR stack: I selected event 0x8E(BR_IND_MISSP_EXEC) and 0x90(BR_RET_MISSP_EXEC) and set counter value to 0xffffffffff to generate PMI for every event. In PMI service routine, I used 3 rdmsr (to read out LBR stack) and 2 wrmsrs( to reset counter and re-enable LBR stack). It takes 1900 cycles per PMI on average in linux- 2.6.27 . It slows down 3x~ 5x for speccpu2006 integer benchmarks.
Here are my questions:
Is there any way to record source instruction on PEBS records with event (0xC5)?
Is there any other way rather than what I've tried to collect source and target addresses of retired mis-predicted indirect branch efficiently?
Does 0x80E (BR_IND_MISSP_EXEC)event include 0x94(BR_IND_CALL_EXEC)?
What is the difference between 0x90(BR_RET_MISSP_EXEC) and 0x91(BR_RET_BAC_MISSP_EXEC)? How does Core 2 Duo predict target of returns?
Thanks in advance
I hope you won't mind if my answer is structured somewhat different from your original questions.
1. Bad news first: There's no way of registering every branch (even only mis-predicted ones) without significant performance overhead. The explanation is quite simple: Typical workloads execute a branch per ~10 instructions; it takes about ~2000 cycles to process PMI (and several hundred cycles to store PEBS/BTS records plus a couple thousand to process PMI eventually), which roughly equals to a couple thousand arithmetic instructions' execution time, so we should have 200x overhead in the worst case. Given the average mis-prediction rate is 1%, the resulting overhead is in the range of ~2-3x.
2. It was more correct when you tried to correlate PEBS records for the precise C5 event (mis-predicted branches retired) with BTS records, rather than correlating exec events with LBRs, because the branches you counted with exec events were not necessarily executed to retirement, which means there may be no address recorded in both LBRs and BTS buffer. The LBRs in this case will just give you logical proximity, that is, indicate that so many branches were mis-predicted while executing certain loop or call chain (which loop and call chain - that's the question to be answered by means of binary or source code analysis).
3. With the above said, I have two suggestions:
(a) Continue with PEBS on C5 and program the PEBS buffer to overflow after collecting a single element, set bit 11 in IA32 Debug Control MSR to freeze LBRs on counter overflow, and find a matching source address (for your recorded destination) in the LBR stack. You'll have to tolerate the overhead in this case.
(b) Switch to a statistical approach and perform event-based sampling over ~100 8E events (all indirect mis-predicted branches), read LBRs in your PMI handler and try to correlate the data with your source code. You'll have to sacrifice the precision in this case.
By the way, you may use the same statistical approach with (a) to enable easier correlation and minimize overhead (at the expense of data precision).
Hope this helps, and sorry about the bad news,
Thanks for the detailed reply,
Now, I have some questions about the precisions of LBR stack with event based sampling since I didn't doubt about the precision of LBR stack. Here is the reason and my speculation:
I've collected mispredicted indirrect branches (8E and 91 events) and LBR stack with freeze_on_PMI set. I compared LBR stack entry with application binaries whenever PMI generated. The top of LBR stack entry always contained indirect branch. Hence I thought although event sampling is not precise, reading LBR stack would give latest source and target of a mispredicted indirect branch. Also I thought even thought PMI generates a couple of instruction later after the misprediction event, PMI would generate before the correct target is fetched during misprediction penalty cycles.
I'm wondering if my speculation is correct.
Thanks in advance,