Turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Intel Community
- Software
- Software Development SDKs and Libraries
- Intel® oneAPI Math Kernel Library
- Best function for inplace matrix addition (w. stride)

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

Henrik_A_

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

06-06-2013
04:15 AM

212 Views

Best function for inplace matrix addition (w. stride)

I often need to calculate the sum of a set of matrices or submatrices of a dataset. Unfortunately the two matrices do not always have the same stride, when I am selectively using a subset of a large dataset, which means I have to resort to calculating the sum by hand (alternatively, I could call vkadd or similar once per row, I'm not sure how much overhead this implies when calling vkadd 500 or 1000 times for a 500x500 matrix).

I am aware of the mkl_?omatadd function, but the documentation states that the input and output arrays cannot overlap, which means I would need an extra temporary matrix. While I would assume calculating A = A + m * B works inplace when not transposing matrices, unless this can be guaranteed for all future versions I cannot use that approach.

Are there any other functions which could be used for this calculation I have missed?

Link Copied

13 Replies

Dmitry_B_Intel

Employee

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

06-06-2013
08:34 PM

212 Views

Hi Henrik,

BLAS level 1 functions ?axpy may help you, as they do in-place operation on vectors: y=a*x + y. When applied row-by-row (or col-by-col) in a loop, this operation can accomodate any combination of strides. The loop may be sped up by parallelization with '#pragma omp parallel for'.

Dima

SergeyKostrov

Valued Contributor II

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

06-06-2013
10:09 PM

212 Views

SergeyKostrov

Valued Contributor II

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

06-06-2013
10:24 PM

212 Views

Henrik_A_

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

06-07-2013
08:05 AM

212 Views

Thanks for the replies.

Dmitry: I think that would be almost identical to using vkadd, the blas function has the additional scaling factor but I am assuming it also contains an optimized case for unscaled addition.

Sergey: That code actually looks very similar to my current approach - I have a function which does addition of double vectors using unrolled SSE intrinsics, and am calling that function on a row by row basis. Assuming sufficient compiler optimization the resulting asm of your first function should look very similar. (Ignoring the missing special cases for lengths != 4 * N). My main problem is when I have to offset one of the matrices by an odd number of columns and the other by an even number of columns, then the data alignment can'\t be matched and I have to fall back to slower code.

I must admit I havn't tested multithreading yet, I have been working under the assumption that the overhead for spinning up/switching to threads is larger than the savings for these small matrix sizes.

SergeyKostrov

Valued Contributor II

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

06-07-2013
04:01 PM

212 Views

SergeyKostrov

Valued Contributor II

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

06-07-2013
06:08 PM

212 Views

Henrik_A_

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

06-10-2013
03:39 AM

212 Views

I think you misunderstood what I meant with matrix offsets. The data for each image is in a single aligned array (e.g. 500x500 doubles aligned on 16/32 byte boundary, along with sizeX, sizeY, stride), but my calculation occasionally requires me to shift the data.

For example, the normal matrix addition case is A'[x,y] = A[x,y] + B[x,y]. Here, alignment is fine, also since the strides of both matrices match and the elements between [sizeX ... stride] are unused, I can use vector addition to compute this.

However, if I am shifting the data by a column, this becomes A'[x,y] = A[x,y] + B[x+1, y]. This calculation can be simplified to a matrix addition of two 499x499 matrices, by shifting the start offset of B' by one element, while keeping the stride the same. Now I have an aligned matrix A and an unaligned matrix B. Also, I can no longer just use vector addition because this would corrupt the last column of A (In this example, A'[x,y] would be A[x,y] + B[0, y+1].

SergeyKostrov

Valued Contributor II

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

06-10-2013
06:27 PM

212 Views

Henrik_A_

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

06-13-2013
09:22 AM

212 Views

Sure, pseudo C++

struct Matrix

{

int width, height, stride;

double *data;

};

void AddToMatrix(Matrix *destMatrix, Matrix *sourceMatrix, long offsetX, long offsetY)

{

// Skip parameter / size verification

if (offsetX == 0 && offsetY == 0)

{

for (unsigned long y=0;y<sourcematrix->height;++y)

for (unsigned long x=0;x<sourcematrix->width;++x)

destMatrix->data[y*destMatrix->stride+x] = destMatrix->data[y*destMatrix->stride+x] + sourceMatrix->data[y*sourceMatrix->stride+x];

return;

}

Matrix clippedDestMatrix = *destMatrix;

Matrix clippedSourceMatrix = *sourceMatrix;

if (offsetX != 0)

{

clippedDestMatrix.width -= abs(offsetX);

clippedSourceMatrix.width -= abs(offsetX);

if (offsetX < 0)

{

clippedSourceMatrix.data = clippedSourceMatrix.data + (-offsetX);

}

else

{

clippedDestMatrix.data = clippedDestMatrix.data + offsetX;

}

}

// ditto for Y

AddToMatrix(&clippedDestMatrix, &clippedSourceMatrix);

}

SergeyKostrov

Valued Contributor II

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

06-13-2013
06:07 PM

212 Views

SergeyKostrov

Valued Contributor II

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

06-13-2013
06:10 PM

212 Views

Henrik_A_

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

06-14-2013
01:20 AM

212 Views

SergeyKostrov

Valued Contributor II

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

06-14-2013
07:10 AM

212 Views

Topic Options

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

For more complete information about compiler optimizations, see our Optimization Notice.