Can't see my previous message, so reposting it...
I'm interested in Branch Trace Store feature of Pentium 4 processor. But it seems to be very slow, on my computer (CPUID: 0F29) "loop $" executes about 380 times slower with BTS active even when the buffers completely reside in L1 cache (Interrupts disabled, HyperThreading turned off). Is there a way to make it faster?
For comparison I tried to implement the same functionality using Single-Stepping on Branches, by my measurements it takes over 1000 clock ticks just to invoke (and return from) debug exception handler, yet about that for 3 necessary MSR accesses (In total 4 times slower than with BTS). Why those apparently simple things work that slow?
Other questions concerning BTS:
Does it make sense to use WC memory type for large BTS buffers in order to avoid cache pollution?
CPUID instruction reference mentions CPL Qualified Debug Store feature (DS-CPL flag) is it implemented on any processors?
Yes, we would acknowledge that BTS changes the performance of the processor in a significant way. There is no way to increase the performance in any meaningful way. To implement the feature in a speculative out of order execution processor requires the clearing of the pipeline on every taken branch and then draining the processor's memory subsystem to ensure correct memory store event ordering.
You can use WC memory as long as you adhere to the alignment and presence rules. You would have to ensure that the processor has been serialized prior to reading the BTS buffer in that case.
Message Edited by intel.software.network.support on 12-02-2005 01:19 PM