<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic @developer in Software Tuning, Performance Optimization &amp; Platform Monitoring</title>
    <link>https://community.intel.com/t5/Software-Tuning-Performance/function-prolugue-consumes-many-cycles/m-p/986379#M3272</link>
    <description>&lt;P&gt;@developer&lt;/P&gt;

&lt;P&gt;If you cannot post the code disassembly, maybe you can track by yourself memory operations which used r14 register prior to the function prologue and post your findings.I suspect that there could be some kind of WAR issue.&lt;/P&gt;</description>
    <pubDate>Tue, 19 Nov 2013 17:37:38 GMT</pubDate>
    <dc:creator>Bernard</dc:creator>
    <dc:date>2013-11-19T17:37:38Z</dc:date>
    <item>
      <title>function prolugue consumes many cycles</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/function-prolugue-consumes-many-cycles/m-p/986366#M3259</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;

&lt;P&gt;I'm profiling my application (linux x86_64, SandyBridge, using&amp;nbsp;perf), and one of its functions takes about 30% of the runtime.&amp;nbsp;&lt;SPAN style="font-size: 1em; line-height: 1.5;"&gt;In the function, looks like saving registers to the stack consumes the most time - this is annotation of function start:&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&amp;nbsp; 0.39 &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;push &amp;nbsp; %r15&lt;BR /&gt;
	&amp;nbsp;95.35 &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;push &amp;nbsp; %r14&lt;BR /&gt;
	&amp;nbsp; 1.27 &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;push &amp;nbsp; %r13&lt;BR /&gt;
	&amp;nbsp; 0.16 &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;push &amp;nbsp; %r12&lt;BR /&gt;
	&amp;nbsp; 0.15 &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;mov &amp;nbsp; &amp;nbsp;%rdi,%r12&lt;BR /&gt;
	&amp;nbsp; 0.25 &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;push &amp;nbsp; %rbp&lt;BR /&gt;
	&amp;nbsp; 0.14 &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;mov &amp;nbsp; &amp;nbsp;%rsi,%rbp&lt;BR /&gt;
	&amp;nbsp; 0.12 &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;push &amp;nbsp; %rbx&lt;BR /&gt;
	&amp;nbsp; 0.18 &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;sub &amp;nbsp; &amp;nbsp;$0x8,%rsp&lt;/P&gt;

&lt;P&gt;I tried using RESOURCE_STALLS counter, and looks like RESOURCE_STALLS.SB is high on this instruction as well, in addition i didn't see high cache misses on this instruction. What can i do to continue investigating this? Are there additional counters which i should examine?&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 1em; line-height: 1.5;"&gt;Thanks&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 18 Nov 2013 16:18:33 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/function-prolugue-consumes-many-cycles/m-p/986366#M3259</guid>
      <dc:creator>developer1</dc:creator>
      <dc:date>2013-11-18T16:18:33Z</dc:date>
    </item>
    <item>
      <title>Can you profile with VTune or</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/function-prolugue-consumes-many-cycles/m-p/986367#M3260</link>
      <description>&lt;P&gt;Can you profile with VTune or post the VTune front-end , back-end pipeline stalls analysis?&lt;/P&gt;

&lt;P&gt;Can you also post that function disassembly?Regarding the&amp;nbsp;&lt;SPAN style="color: rgb(83, 87, 94); font-family: Arial, Helvetica, sans-serif; font-size: 12px; line-height: 14.399999618530273px; background-color: rgb(255, 255, 255);"&gt;&amp;nbsp;&amp;nbsp;push &amp;nbsp; %r14 instruction it is interesting what is beign loaded(pushed) onto stack.It seems that 95.0% of the function's prologue is spent &amp;nbsp;waiting on resource beign available.&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 18 Nov 2013 17:28:18 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/function-prolugue-consumes-many-cycles/m-p/986367#M3260</guid>
      <dc:creator>Bernard</dc:creator>
      <dc:date>2013-11-18T17:28:18Z</dc:date>
    </item>
    <item>
      <title>Hi,</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/function-prolugue-consumes-many-cycles/m-p/986368#M3261</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;

&lt;P&gt;I've uploaded the vtune sampling. I hope&amp;nbsp;these are&amp;nbsp;the correct counters... I've selected General Exploration.&amp;nbsp;&lt;SPAN style="font-size: 1em; line-height: 1.5;"&gt;I see there are first level TLB misses, but there are other places in the code which have even more dTLB misses and much less cycles..&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 1em; line-height: 1.5;"&gt;--Yossi&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 18 Nov 2013 18:20:20 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/function-prolugue-consumes-many-cycles/m-p/986368#M3261</guid>
      <dc:creator>developer1</dc:creator>
      <dc:date>2013-11-18T18:20:20Z</dc:date>
    </item>
    <item>
      <title>In the absence of further</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/function-prolugue-consumes-many-cycles/m-p/986369#M3262</link>
      <description>&lt;P&gt;In the absence of further evidence, I'd guess the time may be associated with allocating a fill buffer, due to previous code having left the buffers full of data pending flush to L1, as might happen when writing to multiple cache lines in a loop.&lt;/P&gt;

&lt;P&gt;If that is so, sometimes forcing the function to in-line or at least take advantage of some ipo could alleviate it.&lt;/P&gt;</description>
      <pubDate>Mon, 18 Nov 2013 19:41:30 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/function-prolugue-consumes-many-cycles/m-p/986369#M3262</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2013-11-18T19:41:30Z</dc:date>
    </item>
    <item>
      <title>thanks for the reply.</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/function-prolugue-consumes-many-cycles/m-p/986370#M3263</link>
      <description>thanks for the reply.
is there a performance counter which could indicate this situation or show which code stores too many cache lines to L1?</description>
      <pubDate>Mon, 18 Nov 2013 20:18:39 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/function-prolugue-consumes-many-cycles/m-p/986370#M3263</guid>
      <dc:creator>developer1</dc:creator>
      <dc:date>2013-11-18T20:18:39Z</dc:date>
    </item>
    <item>
      <title>&gt;&gt;...In the function, looks</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/function-prolugue-consumes-many-cycles/m-p/986371#M3264</link>
      <description>&amp;gt;&amp;gt;...In the function, looks like saving registers to the stack consumes the most time...

Try to use a structure of parameters which could be passed to the function and in that case only one parameter will be saved on the stack.

I have no idea what is wrong with your codes but it is possible that there is a problem with the stack alignment.</description>
      <pubDate>Tue, 19 Nov 2013 02:42:24 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/function-prolugue-consumes-many-cycles/m-p/986371#M3264</guid>
      <dc:creator>SergeyKostrov</dc:creator>
      <dc:date>2013-11-19T02:42:24Z</dc:date>
    </item>
    <item>
      <title>Can you post functions call</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/function-prolugue-consumes-many-cycles/m-p/986372#M3265</link>
      <description>&lt;P&gt;Can you post functions call stack?Does your function receives some input?Such a geat number of RESOURCES_STALLS.ANY could indicate that pipeline can be stalled for example by previous dependent store instructions or by branch misprediction.Can you perform front-end pipeline stalls analysis and post the results?&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 19 Nov 2013 09:46:59 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/function-prolugue-consumes-many-cycles/m-p/986372#M3265</guid>
      <dc:creator>Bernard</dc:creator>
      <dc:date>2013-11-19T09:46:59Z</dc:date>
    </item>
    <item>
      <title>Hi,</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/function-prolugue-consumes-many-cycles/m-p/986373#M3266</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;

&lt;P&gt;The function has no input, and it is called from main() after other functions have been called.&lt;/P&gt;

&lt;P&gt;Frontend analysis shoed no branch mispredictions (BR_MISP_RETIRED.** and BACLEARS.ANY were 0). However, see very large count on LD_BLOCKS_PARTIAL.ADDRESS_ALIAS - about 45% of total program count (see attachment). Does it mean that another instruction is trying to load/store from the same page offset? How can i find/eliminate the conflict?&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;Thanks&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 19 Nov 2013 13:36:39 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/function-prolugue-consumes-many-cycles/m-p/986373#M3266</guid>
      <dc:creator>developer1</dc:creator>
      <dc:date>2013-11-19T13:36:39Z</dc:date>
    </item>
    <item>
      <title>And this is the file..</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/function-prolugue-consumes-many-cycles/m-p/986374#M3267</link>
      <description>&lt;P&gt;And this is the file..&lt;/P&gt;</description>
      <pubDate>Tue, 19 Nov 2013 13:38:47 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/function-prolugue-consumes-many-cycles/m-p/986374#M3267</guid>
      <dc:creator>developer1</dc:creator>
      <dc:date>2013-11-19T13:38:47Z</dc:date>
    </item>
    <item>
      <title>&gt;&gt;...Does it mean that</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/function-prolugue-consumes-many-cycles/m-p/986375#M3268</link>
      <description>&amp;gt;&amp;gt;...Does it mean that another instruction is trying to load/store from the same page offset?..

It is Not clear and so far I could only say that it is a really strange problem. Could you create a simple reproducer?</description>
      <pubDate>Tue, 19 Nov 2013 14:44:12 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/function-prolugue-consumes-many-cycles/m-p/986375#M3268</guid>
      <dc:creator>SergeyKostrov</dc:creator>
      <dc:date>2013-11-19T14:44:12Z</dc:date>
    </item>
    <item>
      <title>&gt;&gt;...I'm profiling my</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/function-prolugue-consumes-many-cycles/m-p/986376#M3269</link>
      <description>&amp;gt;&amp;gt;...I'm profiling my application ( linux &lt;STRONG&gt;x86_64&lt;/STRONG&gt;...

I just realized that you have an interesting system. Is that a &lt;STRONG&gt;32-bit&lt;/STRONG&gt; operating system with &lt;STRONG&gt;64-bit&lt;/STRONG&gt; memory address space extensions?</description>
      <pubDate>Tue, 19 Nov 2013 14:58:03 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/function-prolugue-consumes-many-cycles/m-p/986376#M3269</guid>
      <dc:creator>SergeyKostrov</dc:creator>
      <dc:date>2013-11-19T14:58:03Z</dc:date>
    </item>
    <item>
      <title>By looking at description of</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/function-prolugue-consumes-many-cycles/m-p/986377#M3270</link>
      <description>&lt;P&gt;By looking at description of&amp;nbsp;&lt;SPAN style="color: rgb(83, 87, 94); font-family: Arial, Helvetica, sans-serif; font-size: 12px; line-height: 14.399999618530273px; background-color: rgb(255, 255, 255);"&gt;LD_BLOCKS_PARTIAL.ADDRESS_ALIAS event it seems that it measures false dependency in Memory Order Buffer.I suppose that it could be related to reordering load and stores.I think that address aliasing &amp;nbsp;has been detected in store buffer which prevents further in-order memory operation.It could be also &amp;nbsp;what you are suggesting.It could be interesting if you could post full disassembly.I would like to look at previous operations which involved r14.&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 19 Nov 2013 15:18:08 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/function-prolugue-consumes-many-cycles/m-p/986377#M3270</guid>
      <dc:creator>Bernard</dc:creator>
      <dc:date>2013-11-19T15:18:08Z</dc:date>
    </item>
    <item>
      <title>Hi,</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/function-prolugue-consumes-many-cycles/m-p/986378#M3271</link>
      <description>&lt;P style="font: 13px/19px Arial, 宋体, Tahoma, Helvetica, sans-serif; margin: 0px 0px 1.5em; padding: 0px; color: rgb(83, 87, 94); text-transform: none; text-indent: 0px; letter-spacing: normal; word-spacing: 0px; white-space: normal; font-size-adjust: none; font-stretch: normal; -webkit-text-stroke-width: 0px;"&gt;Hi,&lt;/P&gt;

&lt;P style="font: 13px/19px Arial, 宋体, Tahoma, Helvetica, sans-serif; margin: 0px 0px 1.5em; padding: 0px; color: rgb(83, 87, 94); text-transform: none; text-indent: 0px; letter-spacing: normal; word-spacing: 0px; white-space: normal; font-size-adjust: none; font-stretch: normal; -webkit-text-stroke-width: 0px;"&gt;Thanks for the suggestions, unfortunately I can't post full disassembly due to legal issues..&lt;/P&gt;

&lt;P style="font: 13px/19px Arial, 宋体, Tahoma, Helvetica, sans-serif; margin: 0px 0px 1.5em; padding: 0px; color: rgb(83, 87, 94); text-transform: none; text-indent: 0px; letter-spacing: normal; word-spacing: 0px; white-space: normal; font-size-adjust: none; font-stretch: normal; -webkit-text-stroke-width: 0px;"&gt;So, I've tried to use gdb to try tracking down address which may alias my stack, but no luck. Then,&amp;nbsp;took the advice of TimP and ilyapolak above, and tried to use "mfence" to&amp;nbsp;narrow down the problem. Finally, i've found a piece of code which performs small&amp;nbsp;PCI write, and then sfence, and runs before&amp;nbsp;the function above is called:&lt;/P&gt;

&lt;P style="font: 13px/19px Arial, 宋体, Tahoma, Helvetica, sans-serif; margin: 0px 0px 1.5em; padding: 0px; color: rgb(83, 87, 94); text-transform: none; text-indent: 0px; letter-spacing: normal; word-spacing: 0px; white-space: normal; font-size-adjust: none; font-stretch: normal; -webkit-text-stroke-width: 0px;"&gt;[cpp]&lt;/P&gt;

&lt;P style="font: 13px/19px Arial, 宋体, Tahoma, Helvetica, sans-serif; margin: 0px 0px 1.5em; padding: 0px; color: rgb(83, 87, 94); text-transform: none; text-indent: 0px; letter-spacing: normal; word-spacing: 0px; white-space: normal; font-size-adjust: none; font-stretch: normal; -webkit-text-stroke-width: 0px;"&gt;write_pci(&amp;amp;pci_address, data);&lt;/P&gt;

&lt;P style="font: 13px/19px Arial, 宋体, Tahoma, Helvetica, sans-serif; margin: 0px 0px 1.5em; padding: 0px; color: rgb(83, 87, 94); text-transform: none; text-indent: 0px; letter-spacing: normal; word-spacing: 0px; white-space: normal; font-size-adjust: none; font-stretch: normal; -webkit-text-stroke-width: 0px;"&gt;&lt;SPAN style="line-height: 1.5; font-size: 1em;"&gt;asm volatile ("sfence":::"memory");&lt;/SPAN&gt;&lt;/P&gt;

&lt;P style="font: 13px/19px Arial, 宋体, Tahoma, Helvetica, sans-serif; margin: 0px 0px 1.5em; padding: 0px; color: rgb(83, 87, 94); text-transform: none; text-indent: 0px; letter-spacing: normal; word-spacing: 0px; white-space: normal; font-size-adjust: none; font-stretch: normal; -webkit-text-stroke-width: 0px;"&gt;&lt;SPAN style="line-height: 1.5; font-size: 1em;"&gt;[/cpp]&lt;/SPAN&gt;&lt;/P&gt;

&lt;P style="font: 13px/19px Arial, 宋体, Tahoma, Helvetica, sans-serif; margin: 0px 0px 1.5em; padding: 0px; color: rgb(83, 87, 94); text-transform: none; text-indent: 0px; letter-spacing: normal; word-spacing: 0px; white-space: normal; font-size-adjust: none; font-stretch: normal; -webkit-text-stroke-width: 0px;"&gt;Putting the mfence *before* this sfence had almost no effect. However, putting my mfence right *after* this sfence, makes the "mfence" consume many cycles, instead of the "push r14" which comes sometime after it. Also, removing all memory serializations (sfence and mfence) improved the performance of "push r14" and the whole application.&lt;/P&gt;

&lt;P style="font: 13px/19px Arial, 宋体, Tahoma, Helvetica, sans-serif; margin: 0px 0px 1.5em; padding: 0px; color: rgb(83, 87, 94); text-transform: none; text-indent: 0px; letter-spacing: normal; word-spacing: 0px; white-space: normal; font-size-adjust: none; font-stretch: normal; -webkit-text-stroke-width: 0px;"&gt;So my conclusion was - the PCI write is slow, and sfence made subsequent writes wait for its completion. So the unlucky instruction which filled the store buffer got the hit and actually waited for this PCI write to complete. Does that sound reasonable?&lt;/P&gt;

&lt;P style="font: 13px/19px Arial, 宋体, Tahoma, Helvetica, sans-serif; margin: 0px 0px 1.5em; padding: 0px; color: rgb(83, 87, 94); text-transform: none; text-indent: 0px; letter-spacing: normal; word-spacing: 0px; white-space: normal; font-size-adjust: none; font-stretch: normal; -webkit-text-stroke-width: 0px;"&gt;Also - looks like all performance counters related to memory stores as asynchronous - in the sense they indicate a problem sometime after the root cause of the problem, so need to use mfence to narrow it down. Also, looks like mfence is "synchronous" while sfence/lfence are not, is this true?&lt;/P&gt;

&lt;P style="font: 13px/19px Arial, 宋体, Tahoma, Helvetica, sans-serif; margin: 0px 0px 1.5em; padding: 0px; color: rgb(83, 87, 94); text-transform: none; text-indent: 0px; letter-spacing: normal; word-spacing: 0px; white-space: normal; font-size-adjust: none; font-stretch: normal; -webkit-text-stroke-width: 0px;"&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 19 Nov 2013 17:28:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/function-prolugue-consumes-many-cycles/m-p/986378#M3271</guid>
      <dc:creator>developer1</dc:creator>
      <dc:date>2013-11-19T17:28:00Z</dc:date>
    </item>
    <item>
      <title>@developer</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/function-prolugue-consumes-many-cycles/m-p/986379#M3272</link>
      <description>&lt;P&gt;@developer&lt;/P&gt;

&lt;P&gt;If you cannot post the code disassembly, maybe you can track by yourself memory operations which used r14 register prior to the function prologue and post your findings.I suspect that there could be some kind of WAR issue.&lt;/P&gt;</description>
      <pubDate>Tue, 19 Nov 2013 17:37:38 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/function-prolugue-consumes-many-cycles/m-p/986379#M3272</guid>
      <dc:creator>Bernard</dc:creator>
      <dc:date>2013-11-19T17:37:38Z</dc:date>
    </item>
    <item>
      <title>&gt;&gt;&gt;Also - looks like all</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/function-prolugue-consumes-many-cycles/m-p/986380#M3273</link>
      <description>&lt;P&gt;&lt;SPAN style="color: rgb(83, 87, 94); font-family: Arial, Helvetica, sans-serif; font-size: 12px; line-height: 14.399999618530273px; background-color: rgb(255, 255, 255);"&gt;&amp;gt;&amp;gt;&amp;gt;Also - looks like all performance counters related to memory stores as asynchronous &amp;gt;&amp;gt;&amp;gt;&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="color: rgb(83, 87, 94); font-family: Arial, Helvetica, sans-serif; font-size: 12px; line-height: 14.399999618530273px; background-color: rgb(255, 255, 255);"&gt;I think that internally performance counters related to specific event could be incremented when micro-code and internal logic detects occurence of those events.&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 19 Nov 2013 17:42:05 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/function-prolugue-consumes-many-cycles/m-p/986380#M3273</guid>
      <dc:creator>Bernard</dc:creator>
      <dc:date>2013-11-19T17:42:05Z</dc:date>
    </item>
    <item>
      <title>&gt;&gt;...Also, removing all</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/function-prolugue-consumes-many-cycles/m-p/986381#M3274</link>
      <description>&amp;gt;&amp;gt;...Also, &lt;STRONG&gt;removing all memory serializations (sfence and mfence) improved the performance&lt;/STRONG&gt; of "push r14" and the whole application.

It is a very interesting result because in several linear algebra algorithms I've implemented usage &lt;STRONG&gt;SFENCE improves&lt;/STRONG&gt; performance of the processing by ~5% on 32-bit and 64-bit WIndows systems with &lt;STRONG&gt;Pentium 4&lt;/STRONG&gt;, &lt;STRONG&gt;Atom&lt;/STRONG&gt; and &lt;STRONG&gt;Ivy Bridge&lt;/STRONG&gt; processors. That is, we have completely different results.

Also, I've experimented with &lt;STRONG&gt;SFENCE&lt;/STRONG&gt; and I figured out that it needs to be placed in a proper place during processing and it depends on an algorithm.</description>
      <pubDate>Wed, 20 Nov 2013 05:22:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/function-prolugue-consumes-many-cycles/m-p/986381#M3274</guid>
      <dc:creator>SergeyKostrov</dc:creator>
      <dc:date>2013-11-20T05:22:00Z</dc:date>
    </item>
    <item>
      <title>@Sergey</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/function-prolugue-consumes-many-cycles/m-p/986382#M3275</link>
      <description>&lt;P&gt;@Sergey&lt;/P&gt;

&lt;P&gt;I guess the major difference is that I have slow PCI write, while your algorithms probably have only RAM accesses.. So you are saying SFENCE provided some kind of hint to the CPU to do better memory ordering for your algorithm?&lt;/P&gt;</description>
      <pubDate>Wed, 20 Nov 2013 16:59:58 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/function-prolugue-consumes-many-cycles/m-p/986382#M3275</guid>
      <dc:creator>developer1</dc:creator>
      <dc:date>2013-11-20T16:59:58Z</dc:date>
    </item>
    <item>
      <title>&gt;&gt;...So you are saying SFENCE</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/function-prolugue-consumes-many-cycles/m-p/986383#M3276</link>
      <description>&amp;gt;&amp;gt;...So you are saying SFENCE provided some kind of hint to the CPU to do better memory ordering for your algorithm?..

Yes, exactly!

If you're interested to complete an experiment try to use a 3-loop matrix multiplication algorithm ( a classic-form or a transposed-form ) for verifications if it works on your computer.</description>
      <pubDate>Thu, 21 Nov 2013 05:45:37 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/function-prolugue-consumes-many-cycles/m-p/986383#M3276</guid>
      <dc:creator>SergeyKostrov</dc:creator>
      <dc:date>2013-11-21T05:45:37Z</dc:date>
    </item>
  </channel>
</rss>

