<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Strange behavior for matrix multiplication in Intel® Moderncode for Parallel Architectures</title>
    <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Strange-behavior-for-matrix-multiplication/m-p/898561#M4120</link>
    <description>&lt;DIV style="margin:0px;"&gt;
&lt;DIV id="quote_reply" style="width: 100%; margin-top: 5px;"&gt;
&lt;DIV style="margin-left:2px;margin-right:2px;"&gt;Quoting - &lt;A href="https://community.intel.com/en-us/profile/99850"&gt;jimdempseyatthecove&lt;/A&gt;&lt;/DIV&gt;
&lt;DIV style="background-color:#E5E5E5; padding:5px;border: 1px; border-style: inset;margin-left:2px;margin-right:2px;"&gt;&lt;EM&gt; &lt;BR /&gt;Is matrix a 2D array or a 1D array of pointers to 1D array?&lt;BR /&gt;&lt;BR /&gt;The idea of the code I sent was to isolate the output of rows into a local array, then in seperate loop block copy the row. The purpose of this was an attempt to reduce evictions due to false sharing (on writes). This may also have improved on SSE small vector usage.&lt;BR /&gt;&lt;BR /&gt;Collapsing the loop would not improve upon this.&lt;BR /&gt;&lt;BR /&gt;Try reducing the temp array to size3/2+1 then use two sectons&lt;BR /&gt;{ loop to produce local copy of first half of products, copy local copy to first half of output }&lt;BR /&gt;{ loop to produce local copy of second half of products, copy local copy to second half of output } &lt;BR /&gt;&lt;BR /&gt;Or if you envision larger arrays,peel-off1000 elements at a time (vary according to cache size). Your system has 4MB L2&lt;BR /&gt;&lt;BR /&gt;Jim&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;/EM&gt;&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;BR /&gt;Thank you for your advice, I will get to implementing this soon.&lt;BR /&gt;Any idea why this bad performance only happens for precisely 1024 x 1024 matrices?&lt;BR /&gt;</description>
    <pubDate>Thu, 29 Oct 2009 14:21:45 GMT</pubDate>
    <dc:creator>Tudor</dc:creator>
    <dc:date>2009-10-29T14:21:45Z</dc:date>
    <item>
      <title>Strange behavior for matrix multiplication</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Strange-behavior-for-matrix-multiplication/m-p/898557#M4116</link>
      <description>Hi there,&lt;BR /&gt;&lt;BR /&gt;I wrote this simple algorithm to multiply then used openMP to run it in parallel. When I tested it on my core2quad Q8200 with Windows Server 2008 and 4gb ram, the behavior for two 1024 x 1024 matrices is very strange. &lt;BR /&gt;The time for 1023 x 1023 and 1025 x 1025 is much smaller both sequentially and in parallel, but the time for 1024 x 1024 is huge. Also, the scaling in this case is horrible on 4 cores, barely 33% increased performance. I suspect it has something to do with the cache, but I don't know enough about this subject. &lt;BR /&gt;&lt;BR /&gt;Here is the code:&lt;BR /&gt;
&lt;PRE&gt;[cpp]void OpenMPMatrixMultiply()
{
	int i, j, k;

#pragma omp parallel for private(j, k)
    for (i = 0; i &amp;lt; size1; i++)
    {
        for (j = 0; j &amp;lt; size3; j++)
        {
            int partial = 0;
            for (k = 0; k &amp;lt; size2; k++)
            {
                partial += matrix1&lt;I&gt;&lt;K&gt; * matrix2&lt;K&gt;&lt;J&gt;;
            }
            result1&lt;I&gt;&lt;J&gt; += partial;
        }
    }
}[/cpp]&lt;/J&gt;&lt;/I&gt;&lt;/J&gt;&lt;/K&gt;&lt;/K&gt;&lt;/I&gt;&lt;/PRE&gt;
&lt;BR /&gt;Any help would be appreciated.</description>
      <pubDate>Thu, 29 Oct 2009 11:01:19 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Strange-behavior-for-matrix-multiplication/m-p/898557#M4116</guid>
      <dc:creator>Tudor</dc:creator>
      <dc:date>2009-10-29T11:01:19Z</dc:date>
    </item>
    <item>
      <title>Re: Strange behavior for matrix multiplication</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Strange-behavior-for-matrix-multiplication/m-p/898558#M4117</link>
      <description>&lt;DIV style="margin: 0px; height: auto;"&gt;&lt;/DIV&gt;
&lt;BR /&gt;I see everything is not proceeding as you have forseen.&lt;BR /&gt;&lt;BR /&gt;Try buffering your writes. Something along the line of&lt;BR /&gt;&lt;BR /&gt;
&lt;PRE&gt;[cpp]01.void OpenMPMatrixMultiply()   
02.{   
03.    int i, j, k;   
04.  
05.#pragma omp parallel for private(j, k)   
06.    for (i = 0; i &amp;lt; size1; i++)   
07.    {&lt;BR /&gt;           __declspec(align(64))   
&lt;BR /&gt;           int result1Temp[size3];
08.        for (j = 0; j &amp;lt; size3; j++)   
09.        {   
10.            int partial = 0;   
11.            for (k = 0; k &amp;lt; size2; k++)   
12.            {   
13.                partial += matrix1&lt;I&gt;&lt;K&gt; * matrix2&lt;K&gt;&lt;J&gt;;   
14.            }   
15.            result1Temp&lt;J&gt; = partial;   
16.        }   
           for (j = 0; j &amp;lt; size3; j++)   
           {   
             result1&lt;I&gt;&lt;J&gt; += result1Temp&lt;J&gt;;   
           }   
17.    }   
18.}  
[/cpp]&lt;/J&gt;&lt;/J&gt;&lt;/I&gt;&lt;/J&gt;&lt;/J&gt;&lt;/K&gt;&lt;/K&gt;&lt;/I&gt;&lt;/PRE&gt;
&lt;BR /&gt;Jim Dempsey&lt;BR /&gt;&lt;BR /&gt;</description>
      <pubDate>Thu, 29 Oct 2009 12:30:36 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Strange-behavior-for-matrix-multiplication/m-p/898558#M4117</guid>
      <dc:creator>jimdempseyatthecove</dc:creator>
      <dc:date>2009-10-29T12:30:36Z</dc:date>
    </item>
    <item>
      <title>Re: Strange behavior for matrix multiplication</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Strange-behavior-for-matrix-multiplication/m-p/898559#M4118</link>
      <description>&lt;DIV style="margin:0px;"&gt;
&lt;DIV id="quote_reply" style="width: 100%; margin-top: 5px;"&gt;
&lt;DIV style="margin-left:2px;margin-right:2px;"&gt;Quoting - &lt;A href="https://community.intel.com/en-us/profile/99850"&gt;jimdempseyatthecove&lt;/A&gt;&lt;/DIV&gt;
&lt;DIV style="background-color:#E5E5E5; padding:5px;border: 1px; border-style: inset;margin-left:2px;margin-right:2px;"&gt;&lt;EM&gt; &lt;BR /&gt;I see everything is not proceeding as you have forseen.&lt;BR /&gt;&lt;BR /&gt;Try buffering your writes. Something along the line of&lt;BR /&gt;&lt;BR /&gt;
&lt;/EM&gt;&lt;PRE&gt;&lt;EM&gt;[cpp]01.void OpenMPMatrixMultiply()   &lt;BR /&gt;02.{   &lt;BR /&gt;03.    int i, j, k;   &lt;BR /&gt;04.  &lt;BR /&gt;05.#pragma omp parallel for private(j, k)   &lt;BR /&gt;06.    for (i = 0; i &amp;lt; size1; i++)   &lt;BR /&gt;07.    {&lt;BR /&gt;           __declspec(align(64))   &lt;BR /&gt;&lt;BR /&gt;           int result1Temp[size3];&lt;BR /&gt;08.        for (j = 0; j &amp;lt; size3; j++)   &lt;BR /&gt;09.        {   &lt;BR /&gt;10.            int partial = 0;   &lt;BR /&gt;11.            for (k = 0; k &amp;lt; size2; k++)   &lt;BR /&gt;12.            {   &lt;BR /&gt;13.                partial += matrix1&lt;I&gt;&lt;K&gt; * matrix2&lt;K&gt;&lt;J&gt;;   &lt;BR /&gt;14.            }   &lt;BR /&gt;15.            result1Temp&lt;J&gt; = partial;   &lt;BR /&gt;16.        }   &lt;BR /&gt;           for (j = 0; j &amp;lt; size3; j++)   &lt;BR /&gt;           {   &lt;BR /&gt;             result1&lt;I&gt;&lt;J&gt; += result1Temp&lt;J&gt;;   &lt;BR /&gt;           }   &lt;BR /&gt;17.    }   &lt;BR /&gt;18.}  &lt;BR /&gt;[/cpp]&lt;/J&gt;&lt;/J&gt;&lt;/I&gt;&lt;/J&gt;&lt;/J&gt;&lt;/K&gt;&lt;/K&gt;&lt;/I&gt;&lt;/EM&gt;&lt;/PRE&gt;
&lt;BR /&gt;Jim Dempsey&lt;BR /&gt;&lt;BR /&gt;&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;BR /&gt;How about merging the 1st and 3rd for statements like this:&lt;BR /&gt;&lt;BR /&gt;
&lt;PRE&gt;[cpp]void OpenMPMatrixMultiply2()&lt;BR /&gt;{&lt;BR /&gt;	int i, k, line, column;&lt;BR /&gt;#pragma omp parallel for private(i, line, column)&lt;BR /&gt;	for(k = 0; k &amp;lt; size1 * size3; k++)&lt;BR /&gt;	{&lt;BR /&gt;		line = k / size3;&lt;BR /&gt;		column = k % size3;&lt;BR /&gt;		int partial = 0;&lt;BR /&gt;		for(i = 0; i &amp;lt; size2; i++)&lt;BR /&gt;		{&lt;BR /&gt;			partial += matrix1[line]&lt;I&gt; * matrix2&lt;I&gt;[column];&lt;BR /&gt;		}&lt;BR /&gt;		result[line][column] = partial;&lt;BR /&gt;	}&lt;BR /&gt;}[/cpp]&lt;/I&gt;&lt;/I&gt;&lt;/PRE&gt;
&lt;BR /&gt;Sadly, none of these two modifications seems to produce any difference in the run time.&lt;BR /&gt;</description>
      <pubDate>Thu, 29 Oct 2009 12:39:58 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Strange-behavior-for-matrix-multiplication/m-p/898559#M4118</guid>
      <dc:creator>Tudor</dc:creator>
      <dc:date>2009-10-29T12:39:58Z</dc:date>
    </item>
    <item>
      <title>Re: Strange behavior for matrix multiplication</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Strange-behavior-for-matrix-multiplication/m-p/898560#M4119</link>
      <description>&lt;DIV style="margin:0px;"&gt;&lt;/DIV&gt;
&lt;BR /&gt;Is matrix a 2D array or a 1D array of pointers to 1D array?&lt;BR /&gt;&lt;BR /&gt;The idea of the code I sent was to isolate the output of rows into a local array, then in seperate loop block copy the row. The purpose of this was an attempt to reduce evictions due to false sharing (on writes). This may also have improved on SSE small vector usage.&lt;BR /&gt;&lt;BR /&gt;Collapsing the loop would not improve upon this.&lt;BR /&gt;&lt;BR /&gt;Try reducing the temp array to size3/2+1 then use two sectons&lt;BR /&gt;{ loop to produce local copy of first half of products, copy local copy to first half of output }&lt;BR /&gt;{ loop to produce local copy of second half of products, copy local copy to second half of output } &lt;BR /&gt;&lt;BR /&gt;Or if you envision larger arrays,peel-off1000 elements at a time (vary according to cache size). Your system has 4MB L2&lt;BR /&gt;&lt;BR /&gt;Jim&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;</description>
      <pubDate>Thu, 29 Oct 2009 13:05:53 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Strange-behavior-for-matrix-multiplication/m-p/898560#M4119</guid>
      <dc:creator>jimdempseyatthecove</dc:creator>
      <dc:date>2009-10-29T13:05:53Z</dc:date>
    </item>
    <item>
      <title>Re: Strange behavior for matrix multiplication</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Strange-behavior-for-matrix-multiplication/m-p/898561#M4120</link>
      <description>&lt;DIV style="margin:0px;"&gt;
&lt;DIV id="quote_reply" style="width: 100%; margin-top: 5px;"&gt;
&lt;DIV style="margin-left:2px;margin-right:2px;"&gt;Quoting - &lt;A href="https://community.intel.com/en-us/profile/99850"&gt;jimdempseyatthecove&lt;/A&gt;&lt;/DIV&gt;
&lt;DIV style="background-color:#E5E5E5; padding:5px;border: 1px; border-style: inset;margin-left:2px;margin-right:2px;"&gt;&lt;EM&gt; &lt;BR /&gt;Is matrix a 2D array or a 1D array of pointers to 1D array?&lt;BR /&gt;&lt;BR /&gt;The idea of the code I sent was to isolate the output of rows into a local array, then in seperate loop block copy the row. The purpose of this was an attempt to reduce evictions due to false sharing (on writes). This may also have improved on SSE small vector usage.&lt;BR /&gt;&lt;BR /&gt;Collapsing the loop would not improve upon this.&lt;BR /&gt;&lt;BR /&gt;Try reducing the temp array to size3/2+1 then use two sectons&lt;BR /&gt;{ loop to produce local copy of first half of products, copy local copy to first half of output }&lt;BR /&gt;{ loop to produce local copy of second half of products, copy local copy to second half of output } &lt;BR /&gt;&lt;BR /&gt;Or if you envision larger arrays,peel-off1000 elements at a time (vary according to cache size). Your system has 4MB L2&lt;BR /&gt;&lt;BR /&gt;Jim&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;/EM&gt;&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;BR /&gt;Thank you for your advice, I will get to implementing this soon.&lt;BR /&gt;Any idea why this bad performance only happens for precisely 1024 x 1024 matrices?&lt;BR /&gt;</description>
      <pubDate>Thu, 29 Oct 2009 14:21:45 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Strange-behavior-for-matrix-multiplication/m-p/898561#M4120</guid>
      <dc:creator>Tudor</dc:creator>
      <dc:date>2009-10-29T14:21:45Z</dc:date>
    </item>
    <item>
      <title>Re: Strange behavior for matrix multiplication</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Strange-behavior-for-matrix-multiplication/m-p/898562#M4121</link>
      <description>&lt;DIV style="margin:0px;"&gt;&lt;/DIV&gt;
&lt;BR /&gt;Sounds like a false sharing cache problem.&lt;BR /&gt;&lt;BR /&gt;1024x1024x4 is a multiple of 4MB, so is 1024x1024x8&lt;BR /&gt;&lt;BR /&gt;Try the following:&lt;BR /&gt;&lt;BR /&gt;Get the size of the L2 cache (4MB I think)&lt;BR /&gt;divide it by the number of threads to get byte size of cache&lt;BR /&gt;When you allocate the thread private arrays (assuming allocated ~together)&lt;BR /&gt;&lt;BR /&gt; char* x = new char[cacheSize/nThreads]; // padd&lt;BR /&gt; double* Array = new double[ArraySize]; // real array&lt;BR /&gt; delete []x; // release padd&lt;BR /&gt;&lt;BR /&gt;Now the beginning of the arrays will not be cache aligned.&lt;BR /&gt;(adjust the above if you are using alligned_malloc)&lt;BR /&gt;&lt;BR /&gt;You should have the idea. You can complicate the technique later (i.e. small arrays do not require padd when all arrays fit inside cache).&lt;BR /&gt;&lt;BR /&gt;Jim&lt;BR /&gt;</description>
      <pubDate>Thu, 29 Oct 2009 20:48:59 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Strange-behavior-for-matrix-multiplication/m-p/898562#M4121</guid>
      <dc:creator>jimdempseyatthecove</dc:creator>
      <dc:date>2009-10-29T20:48:59Z</dc:date>
    </item>
    <item>
      <title>Re: Strange behavior for matrix multiplication</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Strange-behavior-for-matrix-multiplication/m-p/898563#M4122</link>
      <description>&lt;DIV style="margin: 0px; height: auto;"&gt;&lt;/DIV&gt;
These CPUs are known for an aliasing problem in cache mapping, where cache lines spaced by 8MB (4MB on some steppings, 64KB on early P4) would evict each other. That's not what is normally known as false sharing, but the effect is likely to be similar, or worse, if the conflicts occur between threads on the same cache.&lt;BR /&gt;The quoted (pseudo?) code seems reasonably well organized to avoid standard false sharing. The evident problem is in each value of column referring to a different cache line, and the likelihood of thrashing the local TLB.&lt;BR /&gt;Without a data type specification, as touched upon earlier in the thread, I don't see how this can be viewed as more than pseudo-code, although I guess it's reasonable to assume plain ints.&lt;BR /&gt;Performance library implementations of matrix multiplication are ubiquitous, and would include one or another method to improve cache locality, of course with vectorization if the data type matches the parallel arithmetic instructions of the CPU.&lt;BR /&gt;</description>
      <pubDate>Thu, 29 Oct 2009 22:16:17 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Strange-behavior-for-matrix-multiplication/m-p/898563#M4122</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2009-10-29T22:16:17Z</dc:date>
    </item>
    <item>
      <title>Re: Strange behavior for matrix multiplication</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Strange-behavior-for-matrix-multiplication/m-p/898564#M4123</link>
      <description>&lt;DIV style="margin:0px;"&gt;
&lt;DIV id="quote_reply" style="width: 100%; margin-top: 5px;"&gt;
&lt;DIV style="margin-left:2px;margin-right:2px;"&gt;Quoting - &lt;A href="https://community.intel.com/en-us/profile/435330"&gt;Tudor&lt;/A&gt;&lt;/DIV&gt;
&lt;DIV style="background-color:#E5E5E5; padding:5px;border: 1px; border-style: inset;margin-left:2px;margin-right:2px;"&gt;&lt;EM&gt;
&lt;DIV style="margin:0px;"&gt;&lt;/DIV&gt;
Thank you for your advice, I will get to implementing this soon.&lt;BR /&gt;Any idea why this bad performance only happens for precisely 1024 x 1024 matrices?&lt;BR /&gt;&lt;/EM&gt;&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;BR /&gt;&lt;STRONG&gt;Probably&lt;/STRONG&gt; it has something to do with limited associativity of lower levels of cache.&lt;BR /&gt;For example, if you have 32k L1D$ with 2-way associativity, then you can hold in the cache at most 2 memory locations which address p is (p % 32k) == C.&lt;BR /&gt;Addresses of values M[i, j] and M[i + 1, j] are separated by exactly 4k (8k), when matrix is of size 1024. So you can hold in the cache at most 16 (8) locations from single matrix column.&lt;BR /&gt;While if matrix size is 1023 or 1025, then values from single column do not conflict in the "(p % 32k) == C" sense, so you can hold in the cache the whole matrix column.&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;</description>
      <pubDate>Fri, 30 Oct 2009 07:01:29 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Strange-behavior-for-matrix-multiplication/m-p/898564#M4123</guid>
      <dc:creator>Dmitry_Vyukov</dc:creator>
      <dc:date>2009-10-30T07:01:29Z</dc:date>
    </item>
    <item>
      <title>Re: Strange behavior for matrix multiplication</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Strange-behavior-for-matrix-multiplication/m-p/898565#M4124</link>
      <description>&lt;DIV style="margin:0px;"&gt;
&lt;DIV id="quote_reply" style="margin-top: 5px; width: 100%;"&gt;
&lt;DIV style="margin-left:2px;margin-right:2px;"&gt;Quoting - &lt;A href="https://community.intel.com/en-us/profile/347331"&gt;Dmitriy Vyukov&lt;/A&gt;&lt;/DIV&gt;
&lt;DIV style="background-color:#E5E5E5; padding:5px;border: 1px; border-style: inset;margin-left:2px;margin-right:2px;"&gt;&lt;EM&gt;
&lt;DIV style="margin:0px;"&gt;&lt;/DIV&gt;
&lt;BR /&gt;&lt;STRONG&gt;Probably&lt;/STRONG&gt; it has something to do with limited associativity of lower levels of cache.&lt;BR /&gt;For example, if you have 32k L1D$ with 2-way associativity, then you can hold in the cache at most 2 memory locations which address p is (p % 32k) == C.&lt;BR /&gt;Addresses of values M[i, j] and M[i + 1, j] are separated by exactly 4k (8k), when matrix is of size 1024. So you can hold in the cache at most 16 (8) locations from single matrix column.&lt;BR /&gt;While if matrix size is 1023 or 1025, then values from single column do not conflict in the "(p % 32k) == C" sense, so you can hold in the cache the whole matrix column.&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;/EM&gt;&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;BR /&gt;Tudor,&lt;BR /&gt;&lt;BR /&gt;Can you allocate larger than what is required?&lt;BR /&gt;IOW Set the row length (number of columns allocated, not referenced) to an odd count (or multiple of 3, 5, 7).&lt;BR /&gt;&lt;BR /&gt;&lt;EM&gt;Addresses of values M[i, j] and M[i + 1, j] &lt;BR /&gt;&lt;/EM&gt;&lt;BR /&gt;Would then always have a TLB skew (could be stairstep, but skewed none the less)&lt;BR /&gt;&lt;BR /&gt;Jim Dempsey&lt;BR /&gt;</description>
      <pubDate>Fri, 30 Oct 2009 13:58:10 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Strange-behavior-for-matrix-multiplication/m-p/898565#M4124</guid>
      <dc:creator>jimdempseyatthecove</dc:creator>
      <dc:date>2009-10-30T13:58:10Z</dc:date>
    </item>
    <item>
      <title>Re: Strange behavior for matrix multiplication</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Strange-behavior-for-matrix-multiplication/m-p/898566#M4125</link>
      <description>&lt;DIV style="margin:0px;"&gt;
&lt;DIV id="quote_reply" style="width: 100%; margin-top: 5px;"&gt;
&lt;DIV style="margin-left:2px;margin-right:2px;"&gt;Quoting - &lt;A href="https://community.intel.com/en-us/profile/99850"&gt;jimdempseyatthecove&lt;/A&gt;&lt;/DIV&gt;
&lt;DIV style="background-color:#E5E5E5; padding:5px;border: 1px; border-style: inset;margin-left:2px;margin-right:2px;"&gt;&lt;EM&gt; &lt;BR /&gt;Tudor,&lt;BR /&gt;&lt;BR /&gt;Can you allocate larger than what is required?&lt;BR /&gt;IOW Set the row length (number of columns allocated, not referenced) to an odd count (or multiple of 3, 5, 7).&lt;BR /&gt;&lt;BR /&gt;&lt;EM&gt;Addresses of values M[i, j] and M[i + 1, j] &lt;BR /&gt;&lt;/EM&gt;&lt;BR /&gt;Would then always have a TLB skew (could be stairstep, but skewed none the less)&lt;BR /&gt;&lt;BR /&gt;Jim Dempsey&lt;BR /&gt;&lt;/EM&gt;&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;BR /&gt;Indeed, it seems that simply allocating an odd number, like size + (size % 2 + 1), solves the problem.&lt;BR /&gt;Also, my original program had bidimensional static allocation of int matrix[1024][1024]. If I change it to array of pointers, then it works ok for the mentioned values, but it obviously yields worse performance.&lt;BR /&gt;Thank you for the cache associativity lesson, that was very informative. It's funny how in college they tell you that the programmer need not worry with the low level issues of hardware implementation, yet even a simple problem such as matrix multiplication can cause such problems.&lt;BR /&gt;</description>
      <pubDate>Fri, 30 Oct 2009 16:03:52 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Strange-behavior-for-matrix-multiplication/m-p/898566#M4125</guid>
      <dc:creator>Tudor</dc:creator>
      <dc:date>2009-10-30T16:03:52Z</dc:date>
    </item>
    <item>
      <title>Re: Strange behavior for matrix multiplication</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Strange-behavior-for-matrix-multiplication/m-p/898567#M4126</link>
      <description>&lt;DIV style="margin:0px;"&gt;&lt;/DIV&gt;
&lt;BR /&gt;&lt;EM&gt;&amp;gt;&amp;gt;It's funny how in college they tell you that the programmer need not worry with the low level issues of hardware implementation, yet even a simple problem such as matrix multiplication can cause such problems.&lt;BR /&gt;&lt;BR /&gt;&lt;/EM&gt;This is the difference between theory and practice. QED (Quite Easily Doh!)&lt;BR /&gt;&lt;BR /&gt;In 10 years this may not make a difference (there will be enough circuit space to remove the n-way associativity from the cache) - all "RAM" will be as fast as L1 cache. (IOW no cache on systems).&lt;BR /&gt;&lt;BR /&gt;Jim Dempsey&lt;BR /&gt;</description>
      <pubDate>Fri, 30 Oct 2009 19:43:39 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Strange-behavior-for-matrix-multiplication/m-p/898567#M4126</guid>
      <dc:creator>jimdempseyatthecove</dc:creator>
      <dc:date>2009-10-30T19:43:39Z</dc:date>
    </item>
    <item>
      <title>Re: Strange behavior for matrix multiplication</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Strange-behavior-for-matrix-multiplication/m-p/898568#M4127</link>
      <description>&lt;DIV style="margin:0px;"&gt;
&lt;DIV id="quote_reply" style="width: 100%; margin-top: 5px;"&gt;
&lt;DIV style="margin-left:2px;margin-right:2px;"&gt;Quoting - &lt;A href="https://community.intel.com/en-us/profile/347331"&gt;Dmitriy Vyukov&lt;/A&gt;&lt;/DIV&gt;
&lt;DIV style="background-color:#E5E5E5; padding:5px;border: 1px; border-style: inset;margin-left:2px;margin-right:2px;"&gt;&lt;EM&gt; &lt;STRONG&gt;Probably&lt;/STRONG&gt; it has something to do with limited associativity of lower levels of cache.&lt;BR /&gt;For example, if you have 32k L1D$ with 2-way associativity, then you can hold in the cache at most 2 memory locations which address p is (p % 32k) == C.&lt;BR /&gt;Addresses of values M[i, j] and M[i + 1, j] are separated by exactly 4k (8k), when matrix is of size 1024. So you can hold in the cache at most 16 (8) locations from single matrix column.&lt;BR /&gt;While if matrix size is 1023 or 1025, then values from single column do not conflict in the "(p % 32k) == C" sense, so you can hold in the cache the whole matrix column.&lt;BR /&gt;&lt;BR /&gt;&lt;/EM&gt;&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;BR /&gt;There are indeed some non-linear effects. You may experiment with the following program to investigate them. It accesses N memory locations in a cycle, memory locations are separated by M bytes. I will provide my output for N=1/2/4/8/16/32/64, M=32768/16384/8192/4096/2048/1024 and 32896/16448/8224/4112/2056/1028.&lt;BR /&gt;&lt;BR /&gt;
&lt;PRE&gt;[cpp]#include "stdafx.h"

#include &lt;WINDOWS.H&gt;
#include &lt;INTRIN.H&gt;
#include &lt;STDIO.H&gt;

template&lt;SIZE_T block_size=""&gt;
struct iter
{
    __forceinline static void run(char volatile* mem)
    {
        char val = mem[idx * block_size]; // read
        //mem[idx * block_size] = 1; // write
        //mem[idx * block_size] += 1; // read-write
        iter&lt;BLOCK_SIZE&gt;::run(mem);
    }
};

template&lt;SIZE_T block_size=""&gt;
struct iter&lt;BLOCK_SIZE&gt;
{
    __forceinline static void run(char volatile* mem)
    {
    }
};

template&lt;SIZE_T block_size=""&gt;
struct test
{
    static size_t const iter_count = 50000000;

    static void run(char* mem)
    {
        iter&lt;BLOCK_SIZE&gt;::run(mem);
        unsigned __int64 t0 = __rdtsc();
        body(mem);
        unsigned __int64 time = __rdtsc() - t0;
        unsigned cost = (unsigned)(time / iter_count);
        printf("location count #%u: %un", count, cost);
        test&lt;BLOCK_SIZE&gt;::run(mem);
    }

    __declspec(noinline) static void body(char volatile* mem)
    {
        for (size_t i = 0; i != iter_count / count / 4; i += 1)
        {
            iter&lt;BLOCK_SIZE&gt;::run(mem);
            iter&lt;BLOCK_SIZE&gt;::run(mem);
            iter&lt;BLOCK_SIZE&gt;::run(mem);
            iter&lt;BLOCK_SIZE&gt;::run(mem);
        }
    }
};

template&lt;SIZE_T block_size=""&gt;
struct test&lt;BLOCK_SIZE&gt;
{
    static void run(char* mem)
    {
    }
};

template&lt;SIZE_T block_size=""&gt;
struct suite
{
    static void run(char* mem)
    {
        printf("nblock_size=%un", block_size);
        test&lt;BLOCK_SIZE&gt;::run(mem);
        suite&amp;lt;(block_size / 2), count, (block_size / 2 &amp;lt; 1024)&amp;gt;::run(mem);
    }
};

template&lt;SIZE_T block_size=""&gt;
struct suite&lt;BLOCK_SIZE&gt;
{
    static void run(char* mem)
    {
    }
};

int main()
{
    size_t const block_size1 = 32*1024;
    size_t const block_size2 = 32*1024 + 128;
    size_t const test_count = 64;

    char* mem = (char*)VirtualAlloc(0, test_count * block_size2, MEM_LARGE_PAGES | MEM_RESERVE | MEM_COMMIT, PAGE_READWRITE);
    if (mem == 0)
    {
        printf("failed to allocate large pagesn");
        mem = (char*)VirtualAlloc(0, test_count * block_size2, MEM_RESERVE | MEM_COMMIT, PAGE_READWRITE);
        if (mem == 0)
            return printf("failed to allocate small pages toon");
    }

    suite&lt;BLOCK_SIZE1&gt;::run(mem);
    suite&lt;BLOCK_SIZE2&gt;::run(mem);
}

[/cpp]&lt;/BLOCK_SIZE2&gt;&lt;/BLOCK_SIZE1&gt;&lt;/BLOCK_SIZE&gt;&lt;/SIZE_T&gt;&lt;/BLOCK_SIZE&gt;&lt;/SIZE_T&gt;&lt;/BLOCK_SIZE&gt;&lt;/SIZE_T&gt;&lt;/BLOCK_SIZE&gt;&lt;/BLOCK_SIZE&gt;&lt;/BLOCK_SIZE&gt;&lt;/BLOCK_SIZE&gt;&lt;/BLOCK_SIZE&gt;&lt;/BLOCK_SIZE&gt;&lt;/SIZE_T&gt;&lt;/BLOCK_SIZE&gt;&lt;/SIZE_T&gt;&lt;/BLOCK_SIZE&gt;&lt;/SIZE_T&gt;&lt;/STDIO.H&gt;&lt;/INTRIN.H&gt;&lt;/WINDOWS.H&gt;&lt;/PRE&gt;
&lt;BR /&gt;&lt;BR /&gt;</description>
      <pubDate>Mon, 02 Nov 2009 08:16:54 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Strange-behavior-for-matrix-multiplication/m-p/898568#M4127</guid>
      <dc:creator>Dmitry_Vyukov</dc:creator>
      <dc:date>2009-11-02T08:16:54Z</dc:date>
    </item>
    <item>
      <title>Re: Strange behavior for matrix multiplication</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Strange-behavior-for-matrix-multiplication/m-p/898569#M4128</link>
      <description>&lt;DIV style="margin:0px;"&gt;
&lt;DIV id="quote_reply" style="width: 100%; margin-top: 5px;"&gt;
&lt;DIV style="margin-left:2px;margin-right:2px;"&gt;Quoting - &lt;A href="https://community.intel.com/en-us/profile/347331"&gt;Dmitriy Vyukov&lt;/A&gt;&lt;/DIV&gt;
&lt;DIV style="background-color:#E5E5E5; padding:5px;border: 1px; border-style: inset;margin-left:2px;margin-right:2px;"&gt;&lt;EM&gt; There are indeed some non-linear effects. You may experiment with the following program to investigate them. It accesses N memory locations in a cycle, memory locations are separated by M bytes. I will provide my output for N=1/2/4/8/16/32/64, M=32768/16384/8192/4096/2048/1024 and 32896/16448/8224/4112/2056/1028.&lt;BR /&gt;&lt;BR /&gt;&lt;/EM&gt;&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;BR /&gt;And here is my output:&lt;BR /&gt;&lt;BR /&gt;block_size=32768&lt;BR /&gt;location count #64: 10&lt;BR /&gt;location count #32: 4&lt;BR /&gt;location count #16: 1&lt;BR /&gt;location count #8: 1&lt;BR /&gt;location count #4: 1&lt;BR /&gt;location count #2: 1&lt;BR /&gt;location count #1: 1&lt;BR /&gt;&lt;BR /&gt;block_size=16384&lt;BR /&gt;location count #64: 4&lt;BR /&gt;location count #32: 4&lt;BR /&gt;location count #16: 1&lt;BR /&gt;location count #8: 1&lt;BR /&gt;location count #4: 1&lt;BR /&gt;location count #2: 1&lt;BR /&gt;location count #1: 1&lt;BR /&gt;&lt;BR /&gt;block_size=8192&lt;BR /&gt;location count #64: 5&lt;BR /&gt;location count #32: 4&lt;BR /&gt;location count #16: 1&lt;BR /&gt;location count #8: 1&lt;BR /&gt;location count #4: 1&lt;BR /&gt;location count #2: 1&lt;BR /&gt;location count #1: 1&lt;BR /&gt;&lt;BR /&gt;block_size=4096&lt;BR /&gt;location count #64: 5&lt;BR /&gt;location count #32: 4&lt;BR /&gt;location count #16: 1&lt;BR /&gt;location count #8: 1&lt;BR /&gt;location count #4: 1&lt;BR /&gt;location count #2: 1&lt;BR /&gt;location count #1: 1&lt;BR /&gt;&lt;BR /&gt;block_size=2048&lt;BR /&gt;location count #64: 6&lt;BR /&gt;location count #32: 3&lt;BR /&gt;location count #16: 1&lt;BR /&gt;location count #8: 1&lt;BR /&gt;location count #4: 1&lt;BR /&gt;location count #2: 1&lt;BR /&gt;location count #1: 1&lt;BR /&gt;&lt;BR /&gt;block_size=1024&lt;BR /&gt;location count #64: 4&lt;BR /&gt;location count #32: 1&lt;BR /&gt;location count #16: 1&lt;BR /&gt;location count #8: 1&lt;BR /&gt;location count #4: 1&lt;BR /&gt;location count #2: 1&lt;BR /&gt;location count #1: 1&lt;BR /&gt;&lt;BR /&gt;block_size=32896&lt;BR /&gt;location count #64: 1&lt;BR /&gt;location count #32: 1&lt;BR /&gt;location count #16: 1&lt;BR /&gt;location count #8: 1&lt;BR /&gt;location count #4: 1&lt;BR /&gt;location count #2: 1&lt;BR /&gt;location count #1: 1&lt;BR /&gt;&lt;BR /&gt;block_size=16448&lt;BR /&gt;location count #64: 1&lt;BR /&gt;location count #32: 1&lt;BR /&gt;location count #16: 1&lt;BR /&gt;location count #8: 1&lt;BR /&gt;location count #4: 1&lt;BR /&gt;location count #2: 1&lt;BR /&gt;location count #1: 1&lt;BR /&gt;&lt;BR /&gt;block_size=8224&lt;BR /&gt;location count #64: 1&lt;BR /&gt;location count #32: 1&lt;BR /&gt;location count #16: 1&lt;BR /&gt;location count #8: 1&lt;BR /&gt;location count #4: 1&lt;BR /&gt;location count #2: 1&lt;BR /&gt;location count #1: 1&lt;BR /&gt;&lt;BR /&gt;block_size=4112&lt;BR /&gt;location count #64: 1&lt;BR /&gt;location count #32: 1&lt;BR /&gt;location count #16: 1&lt;BR /&gt;location count #8: 1&lt;BR /&gt;location count #4: 1&lt;BR /&gt;location count #2: 1&lt;BR /&gt;location count #1: 1&lt;BR /&gt;&lt;BR /&gt;block_size=2056&lt;BR /&gt;location count #64: 1&lt;BR /&gt;location count #32: 1&lt;BR /&gt;location count #16: 1&lt;BR /&gt;location count #8: 1&lt;BR /&gt;location count #4: 1&lt;BR /&gt;location count #2: 1&lt;BR /&gt;location count #1: 1&lt;BR /&gt;&lt;BR /&gt;block_size=1028&lt;BR /&gt;location count #64: 1&lt;BR /&gt;location count #32: 1&lt;BR /&gt;location count #16: 1&lt;BR /&gt;location count #8: 1&lt;BR /&gt;location count #4: 1&lt;BR /&gt;location count #2: 1&lt;BR /&gt;location count #1: 1&lt;BR /&gt;&lt;BR /&gt;</description>
      <pubDate>Mon, 02 Nov 2009 08:17:50 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Strange-behavior-for-matrix-multiplication/m-p/898569#M4128</guid>
      <dc:creator>Dmitry_Vyukov</dc:creator>
      <dc:date>2009-11-02T08:17:50Z</dc:date>
    </item>
  </channel>
</rss>

