<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic I think you misunderstood in Intel® oneAPI Math Kernel Library</title>
    <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Best-function-for-inplace-matrix-addition-w-stride/m-p/983398#M17582</link>
    <description>&lt;P&gt;I think you misunderstood what I meant with matrix offsets. The data for each image is in a single aligned array (e.g. 500x500 doubles aligned on 16/32 byte boundary, along with sizeX, sizeY, stride), but my calculation occasionally requires me to shift the data.&lt;/P&gt;
&lt;P&gt;For example, the normal matrix addition case is A'[x,y] = A[x,y] + B[x,y]. Here, alignment is fine, also since the strides of both matrices match and the elements between [sizeX ... stride] are unused, I can use vector addition to compute this.&lt;/P&gt;
&lt;P&gt;However, if I am shifting the data by a column, this becomes A'[x,y] = A[x,y] + B[x+1, y].&amp;nbsp;This calculation can be simplified to a matrix addition of two 499x499 matrices, by shifting the start offset of B' by one element, while keeping the stride the same. Now I have an aligned matrix A and an unaligned matrix B. Also, I can no longer just use vector addition because this would corrupt the last column of A (In this example, A'[x,y] would be A[x,y] + B[0, y+1].&lt;/P&gt;</description>
    <pubDate>Mon, 10 Jun 2013 10:39:04 GMT</pubDate>
    <dc:creator>Henrik_A_</dc:creator>
    <dc:date>2013-06-10T10:39:04Z</dc:date>
    <item>
      <title>Best function for inplace matrix addition (w. stride)</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Best-function-for-inplace-matrix-addition-w-stride/m-p/983391#M17575</link>
      <description>&lt;P&gt;I often need to calculate the sum of a set of matrices or submatrices of a dataset. Unfortunately the two matrices do not always have the same stride, when I am selectively using a subset of a large dataset, which means I have to resort to calculating the sum by hand (alternatively, I could call vkadd or similar once per row, I'm not sure how much overhead this implies when calling vkadd 500 or 1000 times for a 500x500 matrix).&lt;/P&gt;
&lt;P&gt;I am aware of the mkl_?omatadd function, but the documentation states that the input and output arrays cannot overlap, which means I would need an extra temporary matrix. While I would assume calculating A = A + m * B works inplace when not transposing matrices, unless this can be guaranteed for all future versions I cannot use that approach.&lt;/P&gt;
&lt;P&gt;Are there any other functions which could be used for this calculation I have missed?&lt;/P&gt;</description>
      <pubDate>Thu, 06 Jun 2013 11:15:32 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Best-function-for-inplace-matrix-addition-w-stride/m-p/983391#M17575</guid>
      <dc:creator>Henrik_A_</dc:creator>
      <dc:date>2013-06-06T11:15:32Z</dc:date>
    </item>
    <item>
      <title>Hi Henrik,</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Best-function-for-inplace-matrix-addition-w-stride/m-p/983392#M17576</link>
      <description>&lt;P&gt;Hi Henrik,&lt;/P&gt;
&lt;P&gt;BLAS level 1 functions ?axpy may help you, as they do in-place operation on vectors: y=a*x + y. When applied row-by-row (or col-by-col) in a loop, this operation can accomodate any combination of strides. The loop may be sped up by parallelization with '#pragma omp parallel for'.&lt;/P&gt;
&lt;P&gt;Dima&lt;/P&gt;</description>
      <pubDate>Fri, 07 Jun 2013 03:34:47 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Best-function-for-inplace-matrix-addition-w-stride/m-p/983392#M17576</guid>
      <dc:creator>Dmitry_B_Intel</dc:creator>
      <dc:date>2013-06-07T03:34:47Z</dc:date>
    </item>
    <item>
      <title>&gt;&gt;I often need to calculate</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Best-function-for-inplace-matrix-addition-w-stride/m-p/983393#M17577</link>
      <description>&amp;gt;&amp;gt;I often need to calculate the sum of a set of matrices or submatrices of a dataset...

Matrix additions and subtractions are at the core of Strassen's algorithm for matrix multiplication. I've spent a significant amount of time on implementation ( 4 different versions ) and optimization of these algorithms. I'd like to give you two really small examples:

&lt;STRONG&gt;[ Version 1 - Template based compiled with /O2 or /O3 ( aggressive ) optimizations ]&lt;/STRONG&gt;
...
	inline RTvoid &lt;STRONG&gt;Add&lt;/STRONG&gt;( ... )
	{
		#ifdef _RTMATRIXSET_DIAGNOSTICS
		RTuint64 uiClockS = CrtRdtsc();
		#endif

		RTuint i;
		RTuint j;

		for( i = 0; i &amp;lt; uiSize; i++ )
		{
			for( j = 0; j &amp;lt; uiSize; j += 4 )
			{
				register T tS0 = tA&lt;I&gt;[j  ] + tB&lt;I&gt;[j  ];
				register T tS1 = tA&lt;I&gt;[j+1] + tB&lt;I&gt;[j+1];
				register T tS2 = tA&lt;I&gt;[j+2] + tB&lt;I&gt;[j+2];
				register T tS3 = tA&lt;I&gt;[j+3] + tB&lt;I&gt;[j+3];
				tC&lt;I&gt;[j  ] = tS0;
				tC&lt;I&gt;[j+1] = tS1;
				tC&lt;I&gt;[j+2] = tS2;
				tC&lt;I&gt;[j+3] = tS3;
			}
		}

		#ifdef _RTMATRIXSET_DIAGNOSTICS
		RTuint64 uiClockE = CrtRdtsc();
		CrtPrintf( RTU("Add - Completed in %.3f ms\n"), ( RTfloat )( uiClockE - uiClockS ) / 1000000.0f );
		#endif
	};
...

and

&lt;STRONG&gt;[ Version 2 - IPP based with ippsAdd_32f function ]&lt;/STRONG&gt;
...
	inline RTvoid &lt;STRONG&gt;Add&lt;/STRONG&gt;( ... )
	{
		#ifdef _RTMATRIXSET_DIAGNOSTICS
		RTuint64 uiClockS = CrtRdtsc();
		#endif

		RTuint i;

		for( i = 0; i &amp;lt; uiSize; i++ )
		{
			::&lt;STRONG&gt;ippsAdd_32f&lt;/STRONG&gt;( ( const float * )&amp;amp;tA&lt;I&gt;[0], ( const float * )&amp;amp;tB&lt;I&gt;[0], ( float * )&amp;amp;tC&lt;I&gt;[0], ( RTint )uiSize );
		}

		#ifdef _RTMATRIXSET_DIAGNOSTICS
		RTuint64 uiClockE = CrtRdtsc();
		CrtPrintf( RTU("Add - Completed in %.3f ms\n"), ( RTfloat )( uiClockE - uiClockS ) / 1000000.0f );
		#endif
	};
...

After extensive testing on several hardware platforms, like Ivy Bridge, Atom, Pentium 4, &lt;STRONG&gt;I din't see a significant difference&lt;/STRONG&gt; in performance between these two very simple functions. I could easily post real performance numbers for any mentioned platforms ( if you need, of course ).

&amp;gt;&amp;gt;...I am aware of the &lt;STRONG&gt;mkl_?omatadd&lt;/STRONG&gt; function, but the documentation states that the input and output arrays
&amp;gt;&amp;gt;cannot overlap...

I'm considering to try that function ( for tests only ).&lt;/I&gt;&lt;/I&gt;&lt;/I&gt;&lt;/I&gt;&lt;/I&gt;&lt;/I&gt;&lt;/I&gt;&lt;/I&gt;&lt;/I&gt;&lt;/I&gt;&lt;/I&gt;&lt;/I&gt;&lt;/I&gt;&lt;/I&gt;&lt;/I&gt;</description>
      <pubDate>Fri, 07 Jun 2013 05:09:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Best-function-for-inplace-matrix-addition-w-stride/m-p/983393#M17577</guid>
      <dc:creator>SergeyKostrov</dc:creator>
      <dc:date>2013-06-07T05:09:00Z</dc:date>
    </item>
    <item>
      <title>&gt;&gt;...The loop may be sped up</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Best-function-for-inplace-matrix-addition-w-stride/m-p/983394#M17578</link>
      <description>&amp;gt;&amp;gt;...The loop may be sped up by &lt;STRONG&gt;parallelization with '#pragma omp parallel for'&lt;/STRONG&gt;...

It is a very useful for large matricies but there is a question here: Does it make sence to do it for two 500x500 matricies?

I'll post tomorrow performance numbers for addition of two 512x512 matricies ( without OpenMP ) on Ivy Bridge system.</description>
      <pubDate>Fri, 07 Jun 2013 05:24:52 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Best-function-for-inplace-matrix-addition-w-stride/m-p/983394#M17578</guid>
      <dc:creator>SergeyKostrov</dc:creator>
      <dc:date>2013-06-07T05:24:52Z</dc:date>
    </item>
    <item>
      <title>Thanks for the replies.</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Best-function-for-inplace-matrix-addition-w-stride/m-p/983395#M17579</link>
      <description>&lt;P&gt;Thanks for the replies.&lt;/P&gt;
&lt;P&gt;Dmitry: I think that would be almost identical to using vkadd, the blas function has the additional scaling factor but I am assuming it also contains an optimized case for unscaled addition.&lt;/P&gt;
&lt;P&gt;Sergey: That code actually looks very similar to my current approach - I have a function which does addition of double vectors using unrolled SSE intrinsics, and am calling that function on a row by row basis. Assuming sufficient compiler optimization the resulting asm of your first function should look very similar. (Ignoring the missing special cases for lengths != 4 * N). My main problem is when I have to offset one of the matrices by an odd number of columns and the other by an even number of columns, then the data alignment can'\t be matched and I have to fall back to slower code. &lt;/P&gt;
&lt;P&gt;I must admit I havn't tested multithreading yet, I have been working under the assumption that the overhead for spinning up/switching to threads is larger than the savings for these small matrix sizes.&lt;/P&gt;</description>
      <pubDate>Fri, 07 Jun 2013 15:05:35 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Best-function-for-inplace-matrix-addition-w-stride/m-p/983395#M17579</guid>
      <dc:creator>Henrik_A_</dc:creator>
      <dc:date>2013-06-07T15:05:35Z</dc:date>
    </item>
    <item>
      <title>&gt;&gt;...My main problem is when</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Best-function-for-inplace-matrix-addition-w-stride/m-p/983396#M17580</link>
      <description>&amp;gt;&amp;gt;...My main problem is when I have to offset one of the matrices by an odd number of columns and the other by an even number of
&amp;gt;&amp;gt;columns, &lt;STRONG&gt;then the data alignment can't be matched&lt;/STRONG&gt; and I have to fall back to slower code...

It is Not clear how you've created your matricies. I'd like to share some experience and two different versions are used:

&lt;STRONG&gt;[ Version 1 ]&lt;/STRONG&gt;

&lt;STRONG&gt;Some template class for a Matrix&lt;/STRONG&gt;
{
...
	_RTALIGN32 T *m_ptData1D;
	_RTALIGN32 T **m_ptData2D;
...
};

&lt;STRONG&gt;Some method for initialization&lt;/STRONG&gt;
...
		m_ptData1D = ( T * )CrtMalloc( m_uiSize * sizeof( T ) );
		if( m_ptData1D == RTnull )
			return ( RTbool )RTfalse;

		m_ptData2D = ( T ** )CrtMalloc( m_uiRows * sizeof( T * ) );
		if( m_ptData2D == RTnull )
			return ( RTbool )RTfalse;

		T *ptData = m_ptData1D;

		for( RTuint i = 0; i &amp;lt; m_uiRows; i++ )
		{
			m_ptData2D&lt;I&gt; = ptData;
			ptData += m_uiCols;
		}
...

As you can see there are two pointers, ptData1D and ptData2D, and the underlying 1D array for a 2D array is a Contiguos and  Always Aligned.

&lt;STRONG&gt;[ Version 2 ]&lt;/STRONG&gt;

&lt;STRONG&gt;Some template class for a Data set of two matricies&lt;/STRONG&gt;
{
...
	_RTALIGN32 T **Tmp[2];
...
};

&lt;STRONG&gt;Some method for initialization&lt;/STRONG&gt;
...
			for( i = 0; i &amp;lt; 2; i++ )
			{
				Tmp&lt;I&gt; = ( T ** )CrtCalloc( uiSize, sizeof( T * ) );
				if( Tmp&lt;I&gt; == RTnull )
					return ( RTbool )RTfalse;
				for( j = 0; j &amp;lt; uiSize; j++ )
				{
					Tmp&lt;I&gt;&lt;J&gt; = ( T * )CrtCalloc( uiSize, sizeof( T ) );
					if( Tmp&lt;I&gt;&lt;J&gt; == RTnull )
						return ( RTbool )RTfalse;
				}
			}
...

As you can see during initialization an array of pointers for rows is allocated and then every row of size uiSize with a number of elements of type T is allocated ( represents a 2-D matrix, or a 2-D data set, or a 2-D image ).

In both cases all pointers are alligned and with agressive optimizations by C++ compilers ( any! ) speed ups are significant (!). I remember that Not optimized and Not alligned versions worked for about 29 minutes in some cases. When all optimizations are On the same code works in less then 3 minutes.&lt;/J&gt;&lt;/I&gt;&lt;/J&gt;&lt;/I&gt;&lt;/I&gt;&lt;/I&gt;&lt;/I&gt;</description>
      <pubDate>Fri, 07 Jun 2013 23:01:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Best-function-for-inplace-matrix-addition-w-stride/m-p/983396#M17580</guid>
      <dc:creator>SergeyKostrov</dc:creator>
      <dc:date>2013-06-07T23:01:00Z</dc:date>
    </item>
    <item>
      <title>&gt;&gt;My main problem is when I</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Best-function-for-inplace-matrix-addition-w-stride/m-p/983397#M17581</link>
      <description>&amp;gt;&amp;gt;My main problem is when I have to offset one of the matrices by an odd number of columns and the other by an even
&amp;gt;&amp;gt;number of columns, then the data alignment can'\t be matched and I have to fall back to slower code...

Henrik,

Let me know if you need a demo ( small test case ) that demonstrates how to use the &lt;STRONG&gt;Version 1&lt;/STRONG&gt; technique. That is, underlying 1D array for a 2D array is a Contiguos and Always Aligned.</description>
      <pubDate>Sat, 08 Jun 2013 01:08:21 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Best-function-for-inplace-matrix-addition-w-stride/m-p/983397#M17581</guid>
      <dc:creator>SergeyKostrov</dc:creator>
      <dc:date>2013-06-08T01:08:21Z</dc:date>
    </item>
    <item>
      <title>I think you misunderstood</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Best-function-for-inplace-matrix-addition-w-stride/m-p/983398#M17582</link>
      <description>&lt;P&gt;I think you misunderstood what I meant with matrix offsets. The data for each image is in a single aligned array (e.g. 500x500 doubles aligned on 16/32 byte boundary, along with sizeX, sizeY, stride), but my calculation occasionally requires me to shift the data.&lt;/P&gt;
&lt;P&gt;For example, the normal matrix addition case is A'[x,y] = A[x,y] + B[x,y]. Here, alignment is fine, also since the strides of both matrices match and the elements between [sizeX ... stride] are unused, I can use vector addition to compute this.&lt;/P&gt;
&lt;P&gt;However, if I am shifting the data by a column, this becomes A'[x,y] = A[x,y] + B[x+1, y].&amp;nbsp;This calculation can be simplified to a matrix addition of two 499x499 matrices, by shifting the start offset of B' by one element, while keeping the stride the same. Now I have an aligned matrix A and an unaligned matrix B. Also, I can no longer just use vector addition because this would corrupt the last column of A (In this example, A'[x,y] would be A[x,y] + B[0, y+1].&lt;/P&gt;</description>
      <pubDate>Mon, 10 Jun 2013 10:39:04 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Best-function-for-inplace-matrix-addition-w-stride/m-p/983398#M17582</guid>
      <dc:creator>Henrik_A_</dc:creator>
      <dc:date>2013-06-10T10:39:04Z</dc:date>
    </item>
    <item>
      <title>&gt;&gt;...However, if I am</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Best-function-for-inplace-matrix-addition-w-stride/m-p/983399#M17583</link>
      <description>&amp;gt;&amp;gt;...However, if I am shifting the data by a column, this becomes A'[x,y] = A[x,y] + B[x+1, y]. This calculation can be
&amp;gt;&amp;gt;simplified to a matrix addition of two 499x499 matrices, by shifting the start offset of B' by one element, while keeping
&amp;gt;&amp;gt;the stride the same. Now I have an aligned matrix A and an unaligned matrix B...

Would you be able to create a generic reproducer of the problem?</description>
      <pubDate>Tue, 11 Jun 2013 01:27:05 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Best-function-for-inplace-matrix-addition-w-stride/m-p/983399#M17583</guid>
      <dc:creator>SergeyKostrov</dc:creator>
      <dc:date>2013-06-11T01:27:05Z</dc:date>
    </item>
    <item>
      <title>Sure, pseudo C++</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Best-function-for-inplace-matrix-addition-w-stride/m-p/983400#M17584</link>
      <description>&lt;P&gt;Sure, pseudo C++&lt;/P&gt;
&lt;P&gt;struct Matrix&lt;BR /&gt;{&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; int width, height, stride;&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; double *data;&lt;BR /&gt;};&lt;/P&gt;
&lt;P&gt;void AddToMatrix(Matrix *destMatrix, Matrix *sourceMatrix, long offsetX, long offsetY)&lt;BR /&gt;{&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; // Skip parameter / size verification&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; if (offsetX == 0 &amp;amp;&amp;amp; offsetY == 0)&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; {&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; for (unsigned long y=0;y&amp;lt;sourcematrix-&amp;gt;height;++y)&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; for (unsigned long x=0;x&amp;lt;sourcematrix-&amp;gt;width;++x)&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; destMatrix-&amp;gt;data[y*destMatrix-&amp;gt;stride+x] = destMatrix-&amp;gt;data[y*destMatrix-&amp;gt;stride+x] + sourceMatrix-&amp;gt;data[y*sourceMatrix-&amp;gt;stride+x];&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; return;&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; }&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; Matrix clippedDestMatrix = *destMatrix;&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; Matrix clippedSourceMatrix = *sourceMatrix;&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; if (offsetX != 0)&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; {&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; clippedDestMatrix.width -= abs(offsetX);&lt;BR /&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp;clippedSourceMatrix.width -= abs(offsetX);&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; if (offsetX &amp;lt; 0)&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; {&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; clippedSourceMatrix.data = clippedSourceMatrix.data + (-offsetX);&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; }&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; else&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; {&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; clippedDestMatrix.data = clippedDestMatrix.data + offsetX;&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; }&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; }&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; // ditto for Y&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; AddToMatrix(&amp;amp;clippedDestMatrix, &amp;amp;clippedSourceMatrix);&lt;BR /&gt;}&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 13 Jun 2013 16:22:03 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Best-function-for-inplace-matrix-addition-w-stride/m-p/983400#M17584</guid>
      <dc:creator>Henrik_A_</dc:creator>
      <dc:date>2013-06-13T16:22:03Z</dc:date>
    </item>
    <item>
      <title>&gt;&gt;...</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Best-function-for-inplace-matrix-addition-w-stride/m-p/983401#M17585</link>
      <description>&amp;gt;&amp;gt;...
&amp;gt;&amp;gt;    Matrix clippedDestMatrix = *destMatrix;
&amp;gt;&amp;gt;    Matrix clippedSourceMatrix = *sourceMatrix;
&amp;gt;&amp;gt;...

You're creating &lt;STRONG&gt;local copies&lt;/STRONG&gt; for both matricies to do all the rest processing and of course it takes some time ( especially when matricies are 8Kx8K or larger ). Why wouldn't you have additional member &lt;STRONG&gt;offset&lt;/STRONG&gt; in your base &lt;STRONG&gt;Matrix&lt;/STRONG&gt; struct?
...
struct &lt;STRONG&gt;Matrix&lt;/STRONG&gt;
{
    int width, height, stride, &lt;STRONG&gt;offset&lt;/STRONG&gt;;
    double *data;
};
...</description>
      <pubDate>Fri, 14 Jun 2013 01:07:47 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Best-function-for-inplace-matrix-addition-w-stride/m-p/983401#M17585</guid>
      <dc:creator>SergeyKostrov</dc:creator>
      <dc:date>2013-06-14T01:07:47Z</dc:date>
    </item>
    <item>
      <title>&gt;&gt;...especially when</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Best-function-for-inplace-matrix-addition-w-stride/m-p/983402#M17586</link>
      <description>&amp;gt;&amp;gt;...especially when matricies are 8Kx8K or larger...

This is just for example and I remember that your matricies are smaller ( ~0.5Kx0.5K ).</description>
      <pubDate>Fri, 14 Jun 2013 01:10:28 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Best-function-for-inplace-matrix-addition-w-stride/m-p/983402#M17586</guid>
      <dc:creator>SergeyKostrov</dc:creator>
      <dc:date>2013-06-14T01:10:28Z</dc:date>
    </item>
    <item>
      <title>I'm not. The matrix struct</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Best-function-for-inplace-matrix-addition-w-stride/m-p/983403#M17587</link>
      <description>&lt;P&gt;I'm not. The matrix struct just contains a pointer to the data, when I duplicate the matrix struct I just duplicate the pointer, not the memory containing the data itself. Your offset variable is equivalent to what I am doing when I modify the pointer in the matrix struct, except your method means every matrix manipulation function I write would have to know about the offset, my method means the functions don't know anything about the offset, they just are passed matrix structs with a modified width / base data pointer, and unusual stride.&lt;/P&gt;</description>
      <pubDate>Fri, 14 Jun 2013 08:20:13 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Best-function-for-inplace-matrix-addition-w-stride/m-p/983403#M17587</guid>
      <dc:creator>Henrik_A_</dc:creator>
      <dc:date>2013-06-14T08:20:13Z</dc:date>
    </item>
    <item>
      <title>&gt;&gt;...My main problem is when</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Best-function-for-inplace-matrix-addition-w-stride/m-p/983404#M17588</link>
      <description>&amp;gt;&amp;gt;...My main problem is when I have to offset one of the matrices by an odd number of columns and the other by
&amp;gt;&amp;gt;an even number of columns, then the data alignment can'\t be matched and I have to fall back to slower code...

Try to add some timing functions, like &lt;STRONG&gt;_rdtsc&lt;/STRONG&gt; ( intrinsic ), or &lt;STRONG&gt;GetTickCount&lt;/STRONG&gt; in case of a Windows OS, in your codes and compare outputs in order to understand which part is responsible for a performace decrease. Since you have two cases it won't be difficult to detect which part causes that problem.</description>
      <pubDate>Fri, 14 Jun 2013 14:10:03 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Best-function-for-inplace-matrix-addition-w-stride/m-p/983404#M17588</guid>
      <dc:creator>SergeyKostrov</dc:creator>
      <dc:date>2013-06-14T14:10:03Z</dc:date>
    </item>
  </channel>
</rss>

