<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Branch Trace Store in Software Tuning, Performance Optimization &amp; Platform Monitoring</title>
    <link>https://community.intel.com/t5/Software-Tuning-Performance/Branch-Trace-Store/m-p/806099#M807</link>
    <description>Thanks, Pat. Hope you'll find somebody.&lt;BR /&gt;&lt;BR /&gt;And one more question about BTS.&lt;BR /&gt;Nehalem and newer CPUs have 16 pairs of LBR MSRs while Core 2 have only 4. Does it mean BTS performance with Nehalem will be almost 4 times higher?</description>
    <pubDate>Sat, 09 Jun 2012 01:39:39 GMT</pubDate>
    <dc:creator>q1nex</dc:creator>
    <dc:date>2012-06-09T01:39:39Z</dc:date>
    <item>
      <title>Branch Trace Store</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Branch-Trace-Store/m-p/806097#M805</link>
      <description>Hi,&lt;BR /&gt;&lt;BR /&gt;I need some help in enabling BTS.&lt;BR /&gt;My CPU is Core 2 Duo E8400, OS - Windows XP SP3 32bit.&lt;BR /&gt;I've reread manual and rechecked everything over 9000 times. Everything seems to be correct.&lt;BR /&gt;Thats what I checked:&lt;BR /&gt;&lt;BR /&gt;CPUID.1:EDX[21] = 1&lt;BR /&gt;IA32_MISC_ENABLE[7] (Performance Monitoring Available) = 1: Performance monitoring enabled&lt;BR /&gt;IA32_MISC_ENABLE[11] (Branch Trace Storage Unavailable) = 0: BTS is supported&lt;BR /&gt;IA32_APIC_BASE[11] (APIC global enable/disable flag) = 1: APIC enabled.&lt;BR /&gt;Spurious Interrupt Vector Register, bit 8 = 1: APIC Enabled.&lt;BR /&gt;Error Status Register = 0: there are no APIC errors.&lt;BR /&gt;&lt;BR /&gt;DS area created, IA32_DS_AREA MSR, IA32_DEBUGCTL MSR and LVT Performance Counter Register are set, PMI handler in the IDT is established. APIC registers' base is also checked. For DS area I tried to use reserved in driver image memory and memory allocated with ExAllocatePool().&lt;BR /&gt;&lt;BR /&gt;On enabling BTS the performance (of the core at which it was enabled) slows down, but there are no any records in BTS buffer.&lt;BR /&gt;Here is some code in fasm (simplified parts).&lt;BR /&gt;DS structure:&lt;BR /&gt;[plain]BTS_entries_num = 330
reserved_BTS_entries_num = 80


struct Branch_Record

Branch_From dd ?
Branch_To dd ?
Branch_Predicted dd ?

ends


struct DS
;-------BTS buffer base
BTS_buffer_base dd DS.BTS_buffer

;-------BTS index
BTS_index dd DS.BTS_buffer

;-------BTS absolute maximum
BTS_max dd BTS_entries_num *12 +DS.BTS_buffer

;-------BTS interrupt threshold
BTS_int dd (BTS_entries_num - reserved_BTS_entries_num) *12 +DS.BTS_buffer


;-------PEBS save area
dd 6 dup ?
ld dd ?

av = 128 - ((DS.ld+4) mod 128)	;bytes to align
db av dup ?			;align 128


BTS_buffer Branch_Record.dup BTS_entries_num	;BTS_entries_num of Branch_Record struct
ends[/plain] DS initialization:&lt;BR /&gt;[plain];----allocating memory
push sizeof.DS
push NonPagedPoolCacheAligned
call [ExAllocatePool]
mov [DS_addr], eax


;----DS initialization
lea ebx, [eax+DS.BTS_buffer]
mov [eax+DS.BTS_buffer_base], ebx
mov [eax+DS.BTS_index], ebx

add ebx, BTS_entries_num *sizeof.Branch_Record
mov [eax+DS.BTS_max], ebx

sub ebx, reserved_BTS_entries_num *sizeof.Branch_Record
mov [eax+DS.BTS_int], ebx


;----clearing memory
mov edi, [DS_addr]
add edi, DS.BTS_buffer
mov ecx, BTS_entries_num*sizeof.Branch_Record
xor eax, eax
cld
rep stosb[/plain] Setting LVT and IDT:&lt;BR /&gt;[bash]vec_num = 24h

;		                             fixed       edge sensitive   not masked
mov dword [0FFFE0340h], vec_num or (000b shl 8) or (0b shl 15) or (0b shl 16)


push esi
sidt [esp-2]
pop esi
add esi, vec_num*8

mov eax, IntHandler

mov word [esi], ax
bswap eax
xchg al, ah
mov word [esi+6], ax
mov ax, cs
mov word [esi+2], ax
mov byte [esi+4], 0
mov byte [esi+5], 10001111b[/bash] BTS enabling:&lt;BR /&gt;[plain]mov ecx, 600h	;IA32_DS_AREA
rdmsr
mov eax, [DS_addr]
wrmsr

mov ecx, 01D9h	;IA32_DEBUGCTL
rdmsr
;	        TR	         BTS	     BTINT	    BTS_OFF_OS   BTS_OFF_USR
mov eax, (1 shl 6) or (1 shl 7) or (1 shl 8) or (1 shl 9) or (0 shl 10)
wrmsr[/plain] &lt;BR /&gt;And I also have some questions (manual doesn't give CLEAR answers on them):&lt;BR /&gt; 1. Can DS be on same page with code (if triggering self-modifying code actions doesn't worry me)?&lt;BR /&gt;&lt;BR /&gt; 2. (from manual) "The DS save area can be larger than a page, but the pages must be mapped to &lt;BR /&gt;contiguous linear addresses."&lt;BR /&gt;&lt;BR /&gt;Does it mean that all 3 DS areas must be in pages that are contiguous on LINEAR space or does it mean that pages with DS must be MAPPED to contiguous PHYSICAL addresses? Because pages are mapped to physical addresses rather than linear...&lt;BR /&gt;&lt;BR /&gt; 3. (from manual) "In order to prevent generating an interrupt, when working with &lt;BR /&gt;circular BTS buffer, SW need to set BTS interrupt threshold to a value &lt;BR /&gt;greater than BTS absolute maximum (fields of the DS buffer &lt;BR /&gt;management area). It's not enough to clear the BTINT flag itself only."&lt;BR /&gt;&lt;BR /&gt;In other words, BTINT doesn't control PMIs. So, what is the purpose of BTINT?&lt;BR /&gt;&lt;BR /&gt; 4. APIC registers can only be accessed with mov or other institutions (and, or etc) are acceptable?&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;P.S. Working code in any language (asm is preferred) will be useful.&lt;BR /&gt;&lt;BR /&gt;Thanks,&lt;BR /&gt;q1nex</description>
      <pubDate>Fri, 08 Jun 2012 01:27:23 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Branch-Trace-Store/m-p/806097#M805</guid>
      <dc:creator>q1nex</dc:creator>
      <dc:date>2012-06-08T01:27:23Z</dc:date>
    </item>
    <item>
      <title>Branch Trace Store</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Branch-Trace-Store/m-p/806098#M806</link>
      <description>Hello q1nex,&lt;BR /&gt;I'm trying to find someone who can answer your questions.&lt;BR /&gt;Pat</description>
      <pubDate>Fri, 08 Jun 2012 19:21:28 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Branch-Trace-Store/m-p/806098#M806</guid>
      <dc:creator>Patrick_F_Intel1</dc:creator>
      <dc:date>2012-06-08T19:21:28Z</dc:date>
    </item>
    <item>
      <title>Branch Trace Store</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Branch-Trace-Store/m-p/806099#M807</link>
      <description>Thanks, Pat. Hope you'll find somebody.&lt;BR /&gt;&lt;BR /&gt;And one more question about BTS.&lt;BR /&gt;Nehalem and newer CPUs have 16 pairs of LBR MSRs while Core 2 have only 4. Does it mean BTS performance with Nehalem will be almost 4 times higher?</description>
      <pubDate>Sat, 09 Jun 2012 01:39:39 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Branch-Trace-Store/m-p/806099#M807</guid>
      <dc:creator>q1nex</dc:creator>
      <dc:date>2012-06-09T01:39:39Z</dc:date>
    </item>
    <item>
      <title>Branch Trace Store</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Branch-Trace-Store/m-p/806100#M808</link>
      <description>Hello q1nex,&lt;BR /&gt;Here is the reply from our BTS guy (who was on vacation).&lt;BR /&gt;Note that the LBR facility is much faster than (and quite different from) the BTS facility.&lt;BR /&gt;&lt;BR /&gt;[bash]The problem is that all BTS structures are 64-bit (even in the 32-bit mode) starting from Merom (family 6, model 15), so all pointers in the asm control structures should be declared as DQ instead of DD:
1.	struct Branch_Record   
2.	  
3.	Branch_From dq ?   
4.	Branch_To dq ?   
5.	Branch_Predicted dq ?   
6.	  
7.	ends   
8.	  
9.	  
10.	struct DS   
11.	;-------BTS buffer base   
12.	BTS_buffer_base dq DS.BTS_buffer   
13.	  
14.	;-------BTS index   
15.	BTS_index dq DS.BTS_buffer   
16.	  
17.	;-------BTS absolute maximum   
18.	BTS_max dq BTS_entries_num *12 +DS.BTS_buffer   
19.	  
20.	;-------BTS interrupt threshold   
21.	BTS_int dq (BTS_entries_num - reserved_BTS_entries_num) *12 +DS.BTS_buffer   
22.	  
23.	  
24.	;-------PEBS save area   
25.	dd 6 dup ?   
26.	ld dd ?   
27.	  
28.	av = 128 - ((DS.ld+4) mod 128)  ;bytes to align   
29.	db av dup ?         ;align 128   
30.	  
31.	  
32.	BTS_buffer Branch_Record.dup BTS_entries_num    ;BTS_entries_num of Branch_Record struct   
33.	ends  
And to the other questions:

1.	Can DS be on same page with code (if triggering self-modifying code actions doesn't worry me)?
Never checked it, but can see no problem here.

2.	(from manual) "The DS save area can be larger than a page, but the pages must be mapped to 
contiguous linear addresses."

Does it mean that all 3 DS areas must be in pages that are contiguous on LINEAR space or does it mean that pages with DS must be MAPPED to contiguous PHYSICAL addresses? Because pages are mapped to physical addresses rather than linear...
Yes, the pages should be linearly contiguous.

3.	(from manual) "In order to prevent generating an interrupt, when working with 
circular BTS buffer, SW need to set BTS interrupt threshold to a value 
greater than BTS absolute maximum (fields of the DS buffer 
management area). It's not enough to clear the BTINT flag itself only."

In other words, BTINT doesn't control PMIs. So, what is the purpose of BTINT?
BTINT controls the generation of interrupt If its 0, no interrupt will be generated. Both BTINT and threshold control the buffer operation: the buffer becomes circular if BTINT=0 and Threshold &amp;gt; max_size, the buffer is non-circular and generates PMI if Threshold &amp;lt; max_size and BTINT = 1, and the buffer is non-circular and does not generate PMI if Threshold &amp;lt; max_size and BTINT = 0.

4.	APIC registers can only be accessed with mov or other institutions (and, or etc) are acceptable?
APIC registers can be accessed using any instruction, but one has to take into account various side-effects as, for instance, AND instruction will emit both load and store uOps, and mov instructions are more predictable, thats why they are recommended for use with APIC.

[/bash]&lt;BR /&gt;</description>
      <pubDate>Wed, 20 Jun 2012 14:44:57 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Branch-Trace-Store/m-p/806100#M808</guid>
      <dc:creator>Patrick_F_Intel1</dc:creator>
      <dc:date>2012-06-20T14:44:57Z</dc:date>
    </item>
  </channel>
</rss>

