<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic &amp;gt;&amp;gt;...Is  that option used to in Software Tuning, Performance Optimization &amp; Platform Monitoring</title>
    <link>https://community.intel.com/t5/Software-Tuning-Performance/Notes-about-Loop-Blocking-Optimization-Technique-to-increase/m-p/921374#M1248</link>
    <description>&amp;gt;&amp;gt;...Is  that option used to load ebp register with arbitrary data?

This is what &lt;STRONG&gt;MSDN&lt;/STRONG&gt; says about it:

...
This option &lt;STRONG&gt;speeds function calls&lt;/STRONG&gt;, because no frame pointers need to be set up and removed. It also frees one more register, (EBP on the Intel 386 or later) for storing frequently used variables and sub-expressions.

&lt;STRONG&gt;/Oy&lt;/STRONG&gt; is only available in x86 compilers.
...</description>
    <pubDate>Thu, 20 Jun 2013 04:56:39 GMT</pubDate>
    <dc:creator>SergeyKostrov</dc:creator>
    <dc:date>2013-06-20T04:56:39Z</dc:date>
    <item>
      <title>Notes about Loop-Blocking Optimization Technique to increase performance of processing</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Notes-about-Loop-Blocking-Optimization-Technique-to-increase/m-p/921361#M1235</link>
      <description>&lt;P&gt;&lt;STRONG&gt;[ Note 1 ]&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Loop-Blocking Optimization Technique&lt;/STRONG&gt; is well described in Intel Software Development Manual and Intel C++ compiler User and Reference Guides. After extensive testing I could say that it is very important to select a right &lt;STRONG&gt;Block Size&lt;/STRONG&gt; for the last &lt;STRONG&gt;for-loop&lt;/STRONG&gt; and its optimal size depends on a size of &lt;STRONG&gt;L1&lt;/STRONG&gt; cache line of a CPU.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 14 Jun 2013 05:06:42 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Notes-about-Loop-Blocking-Optimization-Technique-to-increase/m-p/921361#M1235</guid>
      <dc:creator>SergeyKostrov</dc:creator>
      <dc:date>2013-06-14T05:06:42Z</dc:date>
    </item>
    <item>
      <title>[ Note 1 - part 2 ]</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Notes-about-Loop-Blocking-Optimization-Technique-to-increase/m-p/921362#M1236</link>
      <description>&lt;STRONG&gt;[ Note 1 - part 2 ]&lt;/STRONG&gt;

Here are some performance numbers:

...
Sub-Test - Adds array B to array A - Loop-Blocking Optimization Technique ( Single-Precision Floating-Point type )
...
&lt;STRONG&gt;[ Block size 16 elements ( 64 bytes ) ]&lt;/STRONG&gt;
...
ICC - Block Size: 16 - Sub-Test completed in 1516 ticks ( T1 )
MSC - Block Size: 16 - Sub-Test completed in 1531 ticks ( T2 )
MGW - Block Size: 16 - Sub-Test completed in 1546 ticks ( T3 )
...
&lt;STRONG&gt;[ Block size 32 elements ( 128 bytes ) ]&lt;/STRONG&gt;
...
ICC - Block Size: 32 - Sub-Test completed in 6422 ticks
MSC - Block Size: 32 - Sub-Test completed in 6671 ticks
MGW - Block Size: 32 - Sub-Test completed in 6734 ticks
...
&lt;STRONG&gt;[ Block size 64 elements ( 256 bytes ) ]&lt;/STRONG&gt;
...
ICC - Block Size: 64 - Sub-Test completed in 6593 ticks
MSC - Block Size: 64 - Sub-Test completed in 6735 ticks
MGW - Block Size: 64 - Sub-Test completed in 6765 ticks
...</description>
      <pubDate>Fri, 14 Jun 2013 05:07:59 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Notes-about-Loop-Blocking-Optimization-Technique-to-increase/m-p/921362#M1236</guid>
      <dc:creator>SergeyKostrov</dc:creator>
      <dc:date>2013-06-14T05:07:59Z</dc:date>
    </item>
    <item>
      <title>[ Note 2 ]</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Notes-about-Loop-Blocking-Optimization-Technique-to-increase/m-p/921363#M1237</link>
      <description>&lt;STRONG&gt;[ Note 2 ]&lt;/STRONG&gt;

Unrolling, or Vectorization, for the last &lt;STRONG&gt;for-loop&lt;/STRONG&gt; as 4-in-1 improves performance by ~2.3% ( it is average for three C++ compilers I tested ):

...
&lt;STRONG&gt;[ Block size 16 elements ( 16 bytes ) ]&lt;/STRONG&gt;
...
ICC - Block Size: 16 - Sub-Test completed in 1500 ticks &lt;STRONG&gt;Note:&lt;/STRONG&gt; Faster by ~1% compared to T1 ( see previous post for all Tx values )
MSC - Block Size: 16 - Sub-Test completed in 1516 ticks &lt;STRONG&gt;Note:&lt;/STRONG&gt; Faster by ~1% compared to T2
MGW - Block Size: 16 - Sub-Test completed in 1469 ticks &lt;STRONG&gt;Note:&lt;/STRONG&gt; Faster by ~5% compared to T3
...

&lt;STRONG&gt;[ Note 3 ]&lt;/STRONG&gt;

It is possible that &lt;STRONG&gt;Loop-Blocking Optimization Technique&lt;/STRONG&gt; won't improve performance of some processing for very large data sets, if they are greater than GBs when loaded into memory, and when Virtual Memory ( paging file ) is used.

&lt;STRONG&gt;[ Note 4 ]&lt;/STRONG&gt;

ICC - Intel C++ compiler
MSC - Microsoft C++ compiler
MGW - MinGW C++ compiler

&lt;STRONG&gt;[ Note 5 ]&lt;/STRONG&gt;

Optimizations for speed ( /O2 ) used for all C++ compilers.</description>
      <pubDate>Fri, 14 Jun 2013 05:10:34 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Notes-about-Loop-Blocking-Optimization-Technique-to-increase/m-p/921363#M1237</guid>
      <dc:creator>SergeyKostrov</dc:creator>
      <dc:date>2013-06-14T05:10:34Z</dc:date>
    </item>
    <item>
      <title>[ Note 6 ]</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Notes-about-Loop-Blocking-Optimization-Technique-to-increase/m-p/921364#M1238</link>
      <description>&lt;STRONG&gt;[ Note 6 ]&lt;/STRONG&gt;

There is a very good article about the &lt;STRONG&gt;Loop-Blocking Optimization Technique&lt;/STRONG&gt; at:
.
&lt;A href="http://software.intel.com/en-us/articles/performance-tools-for-software-developers-loop-blocking" target="_blank"&gt;http://software.intel.com/en-us/articles/performance-tools-for-software-developers-loop-blocking&lt;/A&gt;

&lt;STRONG&gt;[ Note 7 ]&lt;/STRONG&gt;

Here is another set of performance numbers ( with Intel C++ compiler ):

Sub-Test - MatrixA + MatrixB - Loop-Blocking Optimization Technique
Matrix Size: 4096x4096
...
Block Size :   2 elements (   8 bytes ) - Sub-Test completed in 1265 ticks
Block Size :   4 elements (  16 bytes ) - Sub-Test completed in  625 ticks
Block Size :   8 elements (  32 bytes ) - Sub-Test completed in  485 ticks
&lt;STRONG&gt;Block Size&lt;/STRONG&gt; :  16 elements (  &lt;STRONG&gt;64&lt;/STRONG&gt; bytes ) - Sub-Test completed in  &lt;STRONG&gt;469&lt;/STRONG&gt; ticks &amp;lt;- &lt;STRONG&gt;Best Time&lt;/STRONG&gt;
Block Size :  32 elements ( 128 bytes ) - Sub-Test completed in 1678 ticks
Block Size :  64 elements ( 256 bytes ) - Sub-Test completed in 1682 ticks
Block Size : 128 elements ( 512 bytes ) - Sub-Test completed in 1687 ticks
...</description>
      <pubDate>Sun, 16 Jun 2013 06:15:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Notes-about-Loop-Blocking-Optimization-Technique-to-increase/m-p/921364#M1238</guid>
      <dc:creator>SergeyKostrov</dc:creator>
      <dc:date>2013-06-16T06:15:00Z</dc:date>
    </item>
    <item>
      <title>By Loop - Blocking</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Notes-about-Loop-Blocking-Optimization-Technique-to-increase/m-p/921365#M1239</link>
      <description>&lt;P&gt;By Loop - Blocking Optimization techniques do you mean dividing&amp;nbsp; data block int cache lines (32-bytes) long&amp;nbsp; and inner loop iteration on every double or float value(inside cache line)?&lt;/P&gt;
&lt;P&gt;Here is an example:&lt;/P&gt;
&lt;P&gt;void arrayAdditionTest2(double (*input)[MAX_SIZE],double (*output)[MAX_SIZE]){&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; double _in[MAX_SIZE][MAX_SIZE],_out[MAX_SIZE][MAX_SIZE],result[MAX_SIZE][MAX_SIZE];&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; double (*res)[MAX_SIZE];&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; if(input == NULL || output == NULL)&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; return;&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; input = _in;&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; output = _out;&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; res = result;&lt;BR /&gt;&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; for(int i = 0;i &amp;lt; MAX_SIZE;i++){&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; for(int j = 0;j &amp;lt; MAX_SIZE;j++){&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; printf("array input[] = %.17f \n",*(*(input+i)+j));&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; }&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; }&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;BR /&gt;&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; for(int i = 0;i &amp;lt; MAX_SIZE;i+=CACHE_LINE){&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; for(int j = 0;j &amp;lt; MAX_SIZE;j+=CACHE_LINE){&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; for(int ii = i;ii &amp;lt;i + CACHE_LINE;ii++){&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; for(int jj = j;jj &amp;lt;j + CACHE_LINE;jj++){&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; *(*(res+ii)+jj) = *(*(output+ii)+jj) + *(*(input+ii)+jj);&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; printf("Loop Blocking test2 =&amp;nbsp; %.17f %.17f \n",*(*(res+ii)+jj));&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; }&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; }&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; }&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; }&lt;BR /&gt;&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;BR /&gt;}&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;</description>
      <pubDate>Sun, 16 Jun 2013 09:16:15 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Notes-about-Loop-Blocking-Optimization-Technique-to-increase/m-p/921365#M1239</guid>
      <dc:creator>Bernard</dc:creator>
      <dc:date>2013-06-16T09:16:15Z</dc:date>
    </item>
    <item>
      <title>See Note 1 and Note 6.</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Notes-about-Loop-Blocking-Optimization-Technique-to-increase/m-p/921366#M1240</link>
      <description>See &lt;STRONG&gt;Note 1&lt;/STRONG&gt; and &lt;STRONG&gt;Note 6&lt;/STRONG&gt;.</description>
      <pubDate>Sun, 16 Jun 2013 18:35:34 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Notes-about-Loop-Blocking-Optimization-Technique-to-increase/m-p/921366#M1240</guid>
      <dc:creator>SergeyKostrov</dc:creator>
      <dc:date>2013-06-16T18:35:34Z</dc:date>
    </item>
    <item>
      <title>Thanks.</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Notes-about-Loop-Blocking-Optimization-Technique-to-increase/m-p/921367#M1241</link>
      <description>&lt;P&gt;Thanks.&lt;/P&gt;</description>
      <pubDate>Mon, 17 Jun 2013 05:11:13 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Notes-about-Loop-Blocking-Optimization-Technique-to-increase/m-p/921367#M1241</guid>
      <dc:creator>Bernard</dc:creator>
      <dc:date>2013-06-17T05:11:13Z</dc:date>
    </item>
    <item>
      <title>[ Note 8 ]</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Notes-about-Loop-Blocking-Optimization-Technique-to-increase/m-p/921368#M1242</link>
      <description>&lt;STRONG&gt;[ Note 8 ]&lt;/STRONG&gt;

There are some recommendations related to

-fomit-frame-pointer
-fprefetch-loop-arrays

command line options for MinGW C++ compiler in order to improve performance of processing. However, I did not see any performance gains ( especially for -fprefetch-loop-arrays ) when I tried to use both options.</description>
      <pubDate>Tue, 18 Jun 2013 13:38:02 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Notes-about-Loop-Blocking-Optimization-Technique-to-increase/m-p/921368#M1242</guid>
      <dc:creator>SergeyKostrov</dc:creator>
      <dc:date>2013-06-18T13:38:02Z</dc:date>
    </item>
    <item>
      <title>According to gcc docs, -fomit</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Notes-about-Loop-Blocking-Optimization-Technique-to-increase/m-p/921369#M1243</link>
      <description>&lt;P&gt;According to gcc docs, -fomit-frame-pointer is implied by -O, for cases where it is possible (-g would turn it off).&amp;nbsp; It seems it would be important mainly for 32-bit mode.&lt;/P&gt;
&lt;P&gt;IIRC -fprefetch-loop-arrays was designed for AMD athlon-32 CPUs.&amp;nbsp; On any current CPU, it could be useful only for specialized cases, such as where the limit on hardware prefetched streams is exceeded, or DTLB misses can be mitigated without premature cache eviction.&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 18 Jun 2013 15:16:33 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Notes-about-Loop-Blocking-Optimization-Technique-to-increase/m-p/921369#M1243</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2013-06-18T15:16:33Z</dc:date>
    </item>
    <item>
      <title>Thanks, Tim for these</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Notes-about-Loop-Blocking-Optimization-Technique-to-increase/m-p/921370#M1244</link>
      <description>Thanks, Tim, for these comments.

&amp;gt;&amp;gt;According to gcc docs, &lt;STRONG&gt;-fomit-frame-pointer&lt;/STRONG&gt; is implied by -O, for cases where it is possible (&lt;STRONG&gt;-g&lt;/STRONG&gt; would turn it off).
&amp;gt;&amp;gt;It seems it would be important mainly for 32-bit mode.

I tested it with a test application compiled for &lt;STRONG&gt;Release&lt;/STRONG&gt; configuration and here is a part of command line options for the compiler:

...-O2 -m32 -ffast-math -fomit-frame-pointer...

and I use &lt;STRONG&gt;-g&lt;/STRONG&gt; option for &lt;STRONG&gt;Debug&lt;/STRONG&gt; configuration only:

...-O0 -m32 -g...

&amp;gt;&amp;gt;IIRC &lt;STRONG&gt;-fprefetch-loop-arrays&lt;/STRONG&gt; was designed for AMD athlon-32 CPUs...

That's good to know and I'll check docs as well.</description>
      <pubDate>Tue, 18 Jun 2013 23:57:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Notes-about-Loop-Blocking-Optimization-Technique-to-increase/m-p/921370#M1244</guid>
      <dc:creator>SergeyKostrov</dc:creator>
      <dc:date>2013-06-18T23:57:00Z</dc:date>
    </item>
    <item>
      <title>&gt;&gt;&gt;fomit-frame-pointer&gt;&gt;&gt;</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Notes-about-Loop-Blocking-Optimization-Technique-to-increase/m-p/921371#M1245</link>
      <description>&lt;P&gt;&amp;gt;&amp;gt;&amp;gt;fomit-frame-pointer&amp;gt;&amp;gt;&amp;gt;&lt;/P&gt;
&lt;P&gt;Is&amp;nbsp; that option used to load ebp register with arbitrary data?So call stack frames are accessed with esp register.&lt;/P&gt;</description>
      <pubDate>Wed, 19 Jun 2013 05:22:21 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Notes-about-Loop-Blocking-Optimization-Technique-to-increase/m-p/921371#M1245</guid>
      <dc:creator>Bernard</dc:creator>
      <dc:date>2013-06-19T05:22:21Z</dc:date>
    </item>
    <item>
      <title>[ Note 9 ]</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Notes-about-Loop-Blocking-Optimization-Technique-to-increase/m-p/921372#M1246</link>
      <description>&lt;STRONG&gt;[ Note 9 ]&lt;/STRONG&gt;

Here are results of tests on:

Dell Precision Mobile M4700
Intel Core i7-3840QM ( Ivy Bridge / 4 cores / 8 logical CPUs / ark.intel.com/compare/70846 )
Size of L3 Cache = 8MB   ( shared between all cores for data &amp;amp; instructions )
Size of L2 Cache = 1MB   ( 256KB per core / shared for data &amp;amp; instructions )
Size of L1 Cache = 256KB ( 32KB per core for data &amp;amp; 32KB per core for instructions )
Windows 7 Professional 64-bit

Two versions of &lt;STRONG&gt;Classic&lt;/STRONG&gt; matrix multiplication algorithm tested:

- &lt;STRONG&gt;Transposed&lt;/STRONG&gt; Based
- &lt;STRONG&gt;Loop-Blocking Optimization&lt;/STRONG&gt; Based

&lt;STRONG&gt;Matrix Size&lt;/STRONG&gt;: 2048x2048

Block Size : &lt;STRONG&gt;2&lt;/STRONG&gt; elements ( 8 bytes )
Sub-Test 4.2 - Transposed Technique ( MatMulV2 ) ***************** Completed in  1325 ticks
Sub-Test 4.3 - Loop-Blocking Optimization Technique ( MatMulV3 ) * Completed in 29781 ticks

Block Size : &lt;STRONG&gt;4&lt;/STRONG&gt; elements ( 16 bytes )
Sub-Test 4.2 - Transposed Technique ( MatMulV2 ) ***************** Completed in  1326 ticks
Sub-Test 4.3 - Loop-Blocking Optimization Technique ( MatMulV3 ) * Completed in  9142 ticks

Block Size : &lt;STRONG&gt;8&lt;/STRONG&gt; elements ( 32 bytes )
Sub-Test 4.2 - Transposed Technique ( MatMulV2 ) ***************** Completed in  1339 ticks
Sub-Test 4.3 - Loop-Blocking Optimization Technique ( MatMulV3 ) * Completed in  3978 ticks

Block Size : &lt;STRONG&gt;16&lt;/STRONG&gt; elements ( 64 bytes )
Sub-Test 4.2 - Transposed Technique ( MatMulV2 ) ***************** Completed in  1329 ticks
Sub-Test 4.3 - Loop-Blocking Optimization Technique ( MatMulV3 ) * Completed in  2637 ticks

Block Size : &lt;STRONG&gt;32&lt;/STRONG&gt; elements ( 128 bytes )
Sub-Test 4.2 - Transposed Technique ( MatMulV2 ) ***************** Completed in  1336 ticks
Sub-Test 4.3 - Loop-Blocking Optimization Technique ( MatMulV3 ) * Completed in  2585 ticks

Block Size : &lt;STRONG&gt;64&lt;/STRONG&gt; elements ( 256 bytes )
Sub-Test 4.2 - Transposed Technique ( MatMulV2 ) ***************** Completed in  1321 ticks
Sub-Test 4.3 - Loop-Blocking Optimization Technique ( MatMulV3 ) * Completed in  2543 ticks

Block Size : &lt;STRONG&gt;128&lt;/STRONG&gt; elements ( 512 bytes )
Sub-Test 4.2 - Transposed Technique ( MatMulV2 ) ***************** Completed in  1323 ticks
Sub-Test 4.3 - Loop-Blocking Optimization Technique ( MatMulV3 ) * Completed in  2465 ticks

Block Size : &lt;STRONG&gt;256&lt;/STRONG&gt; elements ( 1024 bytes )
Sub-Test 4.2 - Transposed Technique ( MatMulV2 ) ***************** Completed in  1326 ticks
Sub-Test 4.3 - Loop-Blocking Optimization Technique ( MatMulV3 ) * Completed in  2418 ticks

Block Size : &lt;STRONG&gt;512&lt;/STRONG&gt; elements ( 2048 bytes )
Sub-Test 4.2 - Transposed Technique ( MatMulV2 ) ***************** Completed in  1326 ticks
Sub-Test 4.3 - Loop-Blocking Optimization Technique ( MatMulV3 ) * Completed in  2371 ticks

Block Size : &lt;STRONG&gt;1024&lt;/STRONG&gt; elements ( 4096 bytes )
Sub-Test 4.2 - Transposed Technique ( MatMulV2 ) ***************** Completed in  1321 ticks
Sub-Test 4.3 - Loop-Blocking Optimization Technique ( MatMulV3 ) * Completed in  2372 ticks

Block Size : &lt;STRONG&gt;2048&lt;/STRONG&gt; elements ( 8192 bytes )
Sub-Test 4.2 - Transposed Technique ( MatMulV2 ) ***************** Completed in  1326 ticks
Sub-Test 4.3 - Loop-Blocking Optimization Technique ( MatMulV3 ) * Completed in  2356 ticks

&lt;STRONG&gt;1&lt;/STRONG&gt;. Since cache lines are larger than smaller sizes for the &lt;STRONG&gt;Block Size&lt;/STRONG&gt; parameter don't increase performance of calculations

&lt;STRONG&gt;2&lt;/STRONG&gt;. It is important to note that &lt;STRONG&gt;Classic&lt;/STRONG&gt; matrix multiplication &lt;STRONG&gt;Transposed&lt;/STRONG&gt; Based algorithm always outperforms &lt;STRONG&gt;Classic&lt;/STRONG&gt; matrix multiplication &lt;STRONG&gt;Loop-Blocking Optimization&lt;/STRONG&gt; Based algorithm</description>
      <pubDate>Thu, 20 Jun 2013 04:00:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Notes-about-Loop-Blocking-Optimization-Technique-to-increase/m-p/921372#M1246</guid>
      <dc:creator>SergeyKostrov</dc:creator>
      <dc:date>2013-06-20T04:00:00Z</dc:date>
    </item>
    <item>
      <title>[ Summary for Loop-Blocking</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Notes-about-Loop-Blocking-Optimization-Technique-to-increase/m-p/921373#M1247</link>
      <description>&lt;STRONG&gt;[ Summary for Loop-Blocking Optimization Technique on Ivy Bridge ]&lt;/STRONG&gt;

&lt;STRONG&gt;Matrix Size&lt;/STRONG&gt;: 2048x2048

Block Size : ....&lt;STRONG&gt;2&lt;/STRONG&gt; elements ( ......8 bytes ) - Completed in 29781 ticks
Block Size : ....&lt;STRONG&gt;4&lt;/STRONG&gt; elements ( .....16 bytes ) - Completed in  9142 ticks
Block Size : ....&lt;STRONG&gt;8&lt;/STRONG&gt; elements ( .....32 bytes ) - Completed in  3978 ticks
Block Size : ...&lt;STRONG&gt;16&lt;/STRONG&gt; elements ( ....64 bytes ) - Completed in  2637 ticks
Block Size : ...&lt;STRONG&gt;32&lt;/STRONG&gt; elements ( ..128 bytes ) - Completed in  2585 ticks
Block Size : ...&lt;STRONG&gt;64&lt;/STRONG&gt; elements ( ..256 bytes ) - Completed in  2543 ticks
Block Size : ..&lt;STRONG&gt;128&lt;/STRONG&gt; elements ( ..512 bytes ) - Completed in  2465 ticks
Block Size : ..&lt;STRONG&gt;256&lt;/STRONG&gt; elements ( 1024 bytes ) - Completed in  2418 ticks
Block Size : ..&lt;STRONG&gt;512&lt;/STRONG&gt; elements ( 2048 bytes ) - Completed in  2371 ticks
Block Size : &lt;STRONG&gt;1024&lt;/STRONG&gt; elements ( 4096 bytes ) - Completed in  2372 ticks
Block Size : &lt;STRONG&gt;2048&lt;/STRONG&gt; elements ( 8192 bytes ) - Completed in  2356 ticks</description>
      <pubDate>Thu, 20 Jun 2013 04:12:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Notes-about-Loop-Blocking-Optimization-Technique-to-increase/m-p/921373#M1247</guid>
      <dc:creator>SergeyKostrov</dc:creator>
      <dc:date>2013-06-20T04:12:00Z</dc:date>
    </item>
    <item>
      <title>&gt;&gt;...Is  that option used to</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Notes-about-Loop-Blocking-Optimization-Technique-to-increase/m-p/921374#M1248</link>
      <description>&amp;gt;&amp;gt;...Is  that option used to load ebp register with arbitrary data?

This is what &lt;STRONG&gt;MSDN&lt;/STRONG&gt; says about it:

...
This option &lt;STRONG&gt;speeds function calls&lt;/STRONG&gt;, because no frame pointers need to be set up and removed. It also frees one more register, (EBP on the Intel 386 or later) for storing frequently used variables and sub-expressions.

&lt;STRONG&gt;/Oy&lt;/STRONG&gt; is only available in x86 compilers.
...</description>
      <pubDate>Thu, 20 Jun 2013 04:56:39 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Notes-about-Loop-Blocking-Optimization-Technique-to-increase/m-p/921374#M1248</guid>
      <dc:creator>SergeyKostrov</dc:creator>
      <dc:date>2013-06-20T04:56:39Z</dc:date>
    </item>
    <item>
      <title>Yep that true, but FPO</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Notes-about-Loop-Blocking-Optimization-Technique-to-increase/m-p/921375#M1249</link>
      <description>&lt;P&gt;Yep that true, but FPO complicates debugging.&lt;/P&gt;</description>
      <pubDate>Thu, 20 Jun 2013 05:34:35 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Notes-about-Loop-Blocking-Optimization-Technique-to-increase/m-p/921375#M1249</guid>
      <dc:creator>Bernard</dc:creator>
      <dc:date>2013-06-20T05:34:35Z</dc:date>
    </item>
    <item>
      <title>I don't understand the note</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Notes-about-Loop-Blocking-Optimization-Technique-to-increase/m-p/921376#M1250</link>
      <description>I don't understand the note about Debugging and it is Not relevant to the subject of the thread.</description>
      <pubDate>Thu, 20 Jun 2013 11:58:48 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Notes-about-Loop-Blocking-Optimization-Technique-to-increase/m-p/921376#M1250</guid>
      <dc:creator>SergeyKostrov</dc:creator>
      <dc:date>2013-06-20T11:58:48Z</dc:date>
    </item>
    <item>
      <title>FPO = frame pointer omittion.</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Notes-about-Loop-Blocking-Optimization-Technique-to-increase/m-p/921377#M1251</link>
      <description>&lt;P&gt;FPO = frame pointer omittion.&lt;/P&gt;
&lt;P&gt;Sorry for offtopic post.&lt;/P&gt;</description>
      <pubDate>Thu, 20 Jun 2013 15:56:51 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Notes-about-Loop-Blocking-Optimization-Technique-to-increase/m-p/921377#M1251</guid>
      <dc:creator>Bernard</dc:creator>
      <dc:date>2013-06-20T15:56:51Z</dc:date>
    </item>
  </channel>
</rss>

