I am playing around with the BTS feature and observe a very huge performance penalty. It is said in several places that this is normal. On the other hand I see academic publications that use this feature which report only a very small overhead.
Therefore, I have performed several experiments on different CPUs with different DebugCtl settings and different kind of memory caching types.
What me confuses the most is the fact that experiments with only the TR-flag enabled are *much* slower than those with TR-flag *and* BTS-flag enabled. From my understanding enabling TR+BTS does "more" than only TR, in fact writing the BTM not only to the system bus but also to the DebugStore.
Am I wrong? What is the reason for this "strange" observation?
Branch Trace isdesigned to help tools to profile/diagnose. It can capture a lot of information, and the associated costs (delays) goes with the amount/freqency your tool ask the HW to capture. In-frequent sampling would incur smaller overhead. Doing it frequently would be like attaching a exhaust emission analyzer to the car's tail pipe, it won't be able to drive normally or normal gas mileage.