<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Yeah, so the issue is that in Analyzers</title>
    <link>https://community.intel.com/t5/Analyzers/how-can-Vtune-record-the-asm-code-s-cpu-time/m-p/943138#M7685</link>
    <description>&lt;P&gt;Yeah, so the issue is that VTune Amplifier XE doesn't actually "record the asm code's cpu time." &amp;nbsp;What it does is sample where the code is executing and based on the number of samples estimates the % of time spent on that line or instruction. &amp;nbsp;With Hotspots, a timer is used, with Event-based sampling, the processor is programmed to interrupt execution and record the location of execution.&lt;/P&gt;
&lt;P&gt;You might also review the documention for "event skid", which can impact how you read the results of EBS data wrt asm code.&lt;/P&gt;</description>
    <pubDate>Tue, 30 Apr 2013 23:17:03 GMT</pubDate>
    <dc:creator>David_A_Intel1</dc:creator>
    <dc:date>2013-04-30T23:17:03Z</dc:date>
    <item>
      <title>how can Vtune record the asm code's cpu time</title>
      <link>https://community.intel.com/t5/Analyzers/how-can-Vtune-record-the-asm-code-s-cpu-time/m-p/943136#M7683</link>
      <description>&lt;P&gt;Hello, everyone&lt;/P&gt;
&lt;P&gt;&amp;nbsp; recently, I am using Vtune to test my BSDE code in hotspot mode. I have found some insteresting things.&lt;/P&gt;
&lt;P&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp;int a,a1,a2,a3;&lt;/P&gt;
&lt;P&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp;float trans[4];&lt;/P&gt;
&lt;P&gt;&amp;nbsp; &amp;nbsp; _mm_store_ps(trans,a_sse);&lt;/P&gt;
&lt;P&gt;&amp;nbsp; &amp;nbsp;below are four lines of code&lt;/P&gt;
&lt;OL&gt;
&lt;LI&gt;&amp;nbsp; a= (int)*(trans);&lt;/LI&gt;
&lt;LI&gt;&amp;nbsp; a1= (int)*(trans+1);&lt;/LI&gt;
&lt;LI&gt;&amp;nbsp; a2= (int)(trans[2]);&lt;/LI&gt;
&lt;LI&gt;&amp;nbsp; a3 = (int )trans[3];&amp;nbsp;&lt;/LI&gt;
&lt;/OL&gt;
&lt;P&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp; compile using gcc with -O0 Optimize optimization, the time each line costs increase as below&lt;/P&gt;
&lt;OL&gt;
&lt;LI&gt;a= (int)*(trans); 9.319s&lt;/LI&gt;
&lt;LI&gt;a1= (int)*(trans+1); 1.970s&lt;/LI&gt;
&lt;LI&gt;a2= (int)(trans[2]); 1.020s&lt;/LI&gt;
&lt;LI&gt;a3 = (int )trans[3]; 2.130s&lt;/LI&gt;
&lt;/OL&gt;
&lt;P&gt;inorder to find the hotspot, I open the asm code, take line1's asm and line2's asm as an example&lt;/P&gt;
&lt;P&gt;&lt;BR /&gt;0x4055d2 &amp;nbsp;1 &amp;nbsp; &amp;nbsp; &amp;nbsp;movq -0xb8(%rbp), %rax 1.361s&lt;BR /&gt;0x4055d9 &amp;nbsp;1 &amp;nbsp; &amp;nbsp; &amp;nbsp;movssl (%rax), %xmm0 &lt;BR /&gt;0x4055dd &amp;nbsp;1 &amp;nbsp; &amp;nbsp; &amp;nbsp;cvttss2si %xmm0, %eax 4.238s&lt;BR /&gt;0x4055e1 &amp;nbsp;1 &amp;nbsp; &amp;nbsp; &amp;nbsp;movl %eax, -0xcc(%rbp) 3.720s&lt;/P&gt;
&lt;P&gt;&lt;BR /&gt;0x4055e7 &amp;nbsp;2 &amp;nbsp; &amp;nbsp;movq -0xb8(%rbp), %rax 1.000s&lt;BR /&gt;0x4055ee &amp;nbsp;2 &amp;nbsp; &amp;nbsp;add $0x4, %rax &lt;BR /&gt;0x4055f2 &amp;nbsp; 2 &amp;nbsp; &amp;nbsp;movssl (%rax), %xmm0 &lt;BR /&gt;0x4055f6 &amp;nbsp; 2 &amp;nbsp; &amp;nbsp;cvttss2si %xmm0, %eax 0.100s&lt;BR /&gt;0x4055fa &amp;nbsp; 2 &amp;nbsp; &amp;nbsp;movl %eax, -0xd0(%rbp) 0.870s&lt;/P&gt;
&lt;P&gt;if the cost time of first asm line "0x4055d2 1 &amp;nbsp; &amp;nbsp; &amp;nbsp;movq -0xb8(%rbp), %rax 1.361s" is much larger than "0x4055e7 2 &amp;nbsp; &amp;nbsp;movq -0xb8(%rbp), %rax 1.000s", maybe I can understand it, the first line cause the cache miss which can benefit the second one. But, as you can see, the mainly different is "cvttss2si" and "movl". I don't know what caused the big difference ?&lt;/P&gt;
&lt;P&gt;&amp;nbsp;If I change the order of "a=*" and "a3=*" in the C code, "a3=*" cost more time than "a=*".&amp;nbsp;&lt;/P&gt;
&lt;P&gt;To make things more interesting, I compile the code with -O3 optimization, here is the time each line cost&lt;/P&gt;
&lt;OL&gt;
&lt;LI&gt;a= (int)*(trans); 1.080s&lt;/LI&gt;
&lt;LI&gt;a1= (int)*(trans+1); 1.191s&lt;/LI&gt;
&lt;LI&gt;a2= (int)(trans[2]); 0.520s&lt;/LI&gt;
&lt;LI&gt;a3 = (int )trans[3]; 5.900s&lt;/LI&gt;
&lt;/OL&gt;
&lt;P&gt;The last line cost the most time now. &amp;nbsp;I compared asm code of the last two line below.&lt;/P&gt;
&lt;P&gt;Address Line Assembly CPU Time&lt;BR /&gt;0x4036b0 &amp;nbsp;3 &amp;nbsp;movssl 0x8(%rcx), %xmm4 0.200s&lt;BR /&gt;0x4036b5 &amp;nbsp;4 &amp;nbsp;movssl 0xc(%rcx), %xmm5 2.900s&lt;BR /&gt;0x4036ba &amp;nbsp;3 &amp;nbsp;cvttss2si %xmm4, %edi 0.320s&lt;BR /&gt;0x4036be &amp;nbsp;4 &amp;nbsp;cvttss2si %xmm5, %r10d 3.000s&lt;/P&gt;
&lt;P&gt;With -O3 optimization, gcc put the asm code of line3,4 in the front of line1,2. Why did gcc believe this can save time? I know we put two "movssl" together to speed up the pipeline, But why does the the second "movssl" and "cvttss2si" cost more time than the first? How does Vtune record the asm code's cpu time? is it correct?&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;
&lt;P&gt;Thankyou &amp;nbsp;for your help!!&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Sat, 27 Apr 2013 08:19:38 GMT</pubDate>
      <guid>https://community.intel.com/t5/Analyzers/how-can-Vtune-record-the-asm-code-s-cpu-time/m-p/943136#M7683</guid>
      <dc:creator>Xia_Z_</dc:creator>
      <dc:date>2013-04-27T08:19:38Z</dc:date>
    </item>
    <item>
      <title>VTune uses kernel  mode</title>
      <link>https://community.intel.com/t5/Analyzers/how-can-Vtune-record-the-asm-code-s-cpu-time/m-p/943137#M7684</link>
      <description>&lt;P&gt;VTune uses kernel&amp;nbsp; mode driver to read MSR registers and maybe HEPT timer also.Regarding measuring machine code execution time maybe some kind of instrumentation code(like rdtsc instructions) is injected in profiled application address space.&lt;/P&gt;</description>
      <pubDate>Mon, 29 Apr 2013 20:18:23 GMT</pubDate>
      <guid>https://community.intel.com/t5/Analyzers/how-can-Vtune-record-the-asm-code-s-cpu-time/m-p/943137#M7684</guid>
      <dc:creator>Bernard</dc:creator>
      <dc:date>2013-04-29T20:18:23Z</dc:date>
    </item>
    <item>
      <title>Yeah, so the issue is that</title>
      <link>https://community.intel.com/t5/Analyzers/how-can-Vtune-record-the-asm-code-s-cpu-time/m-p/943138#M7685</link>
      <description>&lt;P&gt;Yeah, so the issue is that VTune Amplifier XE doesn't actually "record the asm code's cpu time." &amp;nbsp;What it does is sample where the code is executing and based on the number of samples estimates the % of time spent on that line or instruction. &amp;nbsp;With Hotspots, a timer is used, with Event-based sampling, the processor is programmed to interrupt execution and record the location of execution.&lt;/P&gt;
&lt;P&gt;You might also review the documention for "event skid", which can impact how you read the results of EBS data wrt asm code.&lt;/P&gt;</description>
      <pubDate>Tue, 30 Apr 2013 23:17:03 GMT</pubDate>
      <guid>https://community.intel.com/t5/Analyzers/how-can-Vtune-record-the-asm-code-s-cpu-time/m-p/943138#M7685</guid>
      <dc:creator>David_A_Intel1</dc:creator>
      <dc:date>2013-04-30T23:17:03Z</dc:date>
    </item>
    <item>
      <title>Hi MrAnderson,</title>
      <link>https://community.intel.com/t5/Analyzers/how-can-Vtune-record-the-asm-code-s-cpu-time/m-p/943139#M7686</link>
      <description>&lt;P&gt;Hi MrAnderson,&lt;/P&gt;
&lt;P&gt;you explained it in your post that VTune actually samples the percentage of time spent on executing code.I suppose that for hotspots analysis it could be possible to inject machine code instructions like rdtsc in thread's address space with CreateRemoteThread and WriteProcessMemory functions , but I have never tested it.&lt;/P&gt;</description>
      <pubDate>Wed, 01 May 2013 09:23:40 GMT</pubDate>
      <guid>https://community.intel.com/t5/Analyzers/how-can-Vtune-record-the-asm-code-s-cpu-time/m-p/943139#M7686</guid>
      <dc:creator>Bernard</dc:creator>
      <dc:date>2013-05-01T09:23:40Z</dc:date>
    </item>
    <item>
      <title>I don't think Vtune use rdtsc</title>
      <link>https://community.intel.com/t5/Analyzers/how-can-Vtune-record-the-asm-code-s-cpu-time/m-p/943140#M7687</link>
      <description>&lt;P&gt;I don't think Vtune use rdtsc to complete this function, I think it uses some registers which could be read through MSR in linux. You can write evnet such as &amp;nbsp;CPU_CLK_UNHALTED.REF, and the register would count the unhalted instructions used the reference clock cycles&lt;/P&gt;</description>
      <pubDate>Wed, 01 May 2013 12:19:29 GMT</pubDate>
      <guid>https://community.intel.com/t5/Analyzers/how-can-Vtune-record-the-asm-code-s-cpu-time/m-p/943140#M7687</guid>
      <dc:creator>Xia_Z_</dc:creator>
      <dc:date>2013-05-01T12:19:29Z</dc:date>
    </item>
    <item>
      <title>Quote:Xia Z. wrote:</title>
      <link>https://community.intel.com/t5/Analyzers/how-can-Vtune-record-the-asm-code-s-cpu-time/m-p/943141#M7688</link>
      <description>&lt;P&gt;&lt;/P&gt;&lt;BLOCKQUOTE&gt;Xia Z. wrote:&lt;BR /&gt;&lt;P&gt;&lt;/P&gt;
&lt;P&gt;I don't think Vtune use rdtsc to complete this function, I think it uses some registers which could be read through MSR in linux. You can write evnet such as &amp;nbsp;CPU_CLK_UNHALTED.REF, and the register would count the unhalted instructions used the reference clock cycles&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;&lt;/P&gt;
&lt;P&gt;Yes I know this.I simply thought about the some kind of code injection into profiled process,but it seems that there is to much programming overhead&lt;/P&gt;
&lt;P&gt;The better option is to use CPU_CLK_UNHALTED.REF divided by reference clock cycles to track time spent in various portions of code.&lt;/P&gt;</description>
      <pubDate>Wed, 01 May 2013 13:00:28 GMT</pubDate>
      <guid>https://community.intel.com/t5/Analyzers/how-can-Vtune-record-the-asm-code-s-cpu-time/m-p/943141#M7688</guid>
      <dc:creator>Bernard</dc:creator>
      <dc:date>2013-05-01T13:00:28Z</dc:date>
    </item>
    <item>
      <title>I can understant vtune use</title>
      <link>https://community.intel.com/t5/Analyzers/how-can-Vtune-record-the-asm-code-s-cpu-time/m-p/943142#M7689</link>
      <description>&lt;P&gt;I can understant vtune use registers to record clcok cycles between some lines of code, record the old_value at the begin of the code segment, record the new_vaule at the end of the code segment, we can use (new_value - old_vaule) to get the result, but how can vtune record cycles per asm line cost? I am confused by it.&lt;/P&gt;</description>
      <pubDate>Sun, 05 May 2013 11:33:21 GMT</pubDate>
      <guid>https://community.intel.com/t5/Analyzers/how-can-Vtune-record-the-asm-code-s-cpu-time/m-p/943142#M7689</guid>
      <dc:creator>Xia_Z_</dc:creator>
      <dc:date>2013-05-05T11:33:21Z</dc:date>
    </item>
    <item>
      <title>&gt;&gt;&gt;but how can vtune record</title>
      <link>https://community.intel.com/t5/Analyzers/how-can-Vtune-record-the-asm-code-s-cpu-time/m-p/943143#M7690</link>
      <description>&lt;P&gt;&amp;gt;&amp;gt;&amp;gt;but how can vtune record cycles per asm line cost? I am confused by it.&amp;gt;&amp;gt;&amp;gt;&lt;/P&gt;
&lt;P&gt;By using CPU_CLK_UNHALTED.REF.&lt;/P&gt;</description>
      <pubDate>Mon, 06 May 2013 17:29:20 GMT</pubDate>
      <guid>https://community.intel.com/t5/Analyzers/how-can-Vtune-record-the-asm-code-s-cpu-time/m-p/943143#M7690</guid>
      <dc:creator>Bernard</dc:creator>
      <dc:date>2013-05-06T17:29:20Z</dc:date>
    </item>
  </channel>
</rss>

