Intel® oneAPI Math Kernel Library
Ask questions and share information with other developers who use Intel® Math Kernel Library.
Announcements
FPGA community forums and blogs have moved to the Altera Community. Existing Intel Community members can sign in with their current credentials.

Best function for inplace matrix addition (w. stride)

Henrik_A_
Beginner
1,783 Views

I often need to calculate the sum of a set of matrices or submatrices of a dataset. Unfortunately the two matrices do not always have the same stride, when I am selectively using a subset of a large dataset, which means I have to resort to calculating the sum by hand (alternatively, I could call vkadd or similar once per row, I'm not sure how much overhead this implies when calling vkadd 500 or 1000 times for a 500x500 matrix).

I am aware of the mkl_?omatadd function, but the documentation states that the input and output arrays cannot overlap, which means I would need an extra temporary matrix. While I would assume calculating A = A + m * B works inplace when not transposing matrices, unless this can be guaranteed for all future versions I cannot use that approach.

Are there any other functions which could be used for this calculation I have missed?

0 Kudos
13 Replies
Dmitry_B_Intel
Employee
1,783 Views

Hi Henrik,

BLAS level 1 functions ?axpy may help you, as they do in-place operation on vectors: y=a*x + y. When applied row-by-row (or col-by-col) in a loop, this operation can accomodate any combination of strides. The loop may be sped up by parallelization with '#pragma omp parallel for'.

Dima

0 Kudos
SergeyKostrov
Valued Contributor II
1,783 Views
>>I often need to calculate the sum of a set of matrices or submatrices of a dataset... Matrix additions and subtractions are at the core of Strassen's algorithm for matrix multiplication. I've spent a significant amount of time on implementation ( 4 different versions ) and optimization of these algorithms. I'd like to give you two really small examples: [ Version 1 - Template based compiled with /O2 or /O3 ( aggressive ) optimizations ] ... inline RTvoid Add( ... ) { #ifdef _RTMATRIXSET_DIAGNOSTICS RTuint64 uiClockS = CrtRdtsc(); #endif RTuint i; RTuint j; for( i = 0; i < uiSize; i++ ) { for( j = 0; j < uiSize; j += 4 ) { register T tS0 = tA[j ] + tB[j ]; register T tS1 = tA[j+1] + tB[j+1]; register T tS2 = tA[j+2] + tB[j+2]; register T tS3 = tA[j+3] + tB[j+3]; tC[j ] = tS0; tC[j+1] = tS1; tC[j+2] = tS2; tC[j+3] = tS3; } } #ifdef _RTMATRIXSET_DIAGNOSTICS RTuint64 uiClockE = CrtRdtsc(); CrtPrintf( RTU("Add - Completed in %.3f ms\n"), ( RTfloat )( uiClockE - uiClockS ) / 1000000.0f ); #endif }; ... and [ Version 2 - IPP based with ippsAdd_32f function ] ... inline RTvoid Add( ... ) { #ifdef _RTMATRIXSET_DIAGNOSTICS RTuint64 uiClockS = CrtRdtsc(); #endif RTuint i; for( i = 0; i < uiSize; i++ ) { ::ippsAdd_32f( ( const float * )&tA[0], ( const float * )&tB[0], ( float * )&tC[0], ( RTint )uiSize ); } #ifdef _RTMATRIXSET_DIAGNOSTICS RTuint64 uiClockE = CrtRdtsc(); CrtPrintf( RTU("Add - Completed in %.3f ms\n"), ( RTfloat )( uiClockE - uiClockS ) / 1000000.0f ); #endif }; ... After extensive testing on several hardware platforms, like Ivy Bridge, Atom, Pentium 4, I din't see a significant difference in performance between these two very simple functions. I could easily post real performance numbers for any mentioned platforms ( if you need, of course ). >>...I am aware of the mkl_?omatadd function, but the documentation states that the input and output arrays >>cannot overlap... I'm considering to try that function ( for tests only ).
0 Kudos
SergeyKostrov
Valued Contributor II
1,783 Views
>>...The loop may be sped up by parallelization with '#pragma omp parallel for'... It is a very useful for large matricies but there is a question here: Does it make sence to do it for two 500x500 matricies? I'll post tomorrow performance numbers for addition of two 512x512 matricies ( without OpenMP ) on Ivy Bridge system.
0 Kudos
Henrik_A_
Beginner
1,783 Views

Thanks for the replies.

Dmitry: I think that would be almost identical to using vkadd, the blas function has the additional scaling factor but I am assuming it also contains an optimized case for unscaled addition.

Sergey: That code actually looks very similar to my current approach - I have a function which does addition of double vectors using unrolled SSE intrinsics, and am calling that function on a row by row basis. Assuming sufficient compiler optimization the resulting asm of your first function should look very similar. (Ignoring the missing special cases for lengths != 4 * N). My main problem is when I have to offset one of the matrices by an odd number of columns and the other by an even number of columns, then the data alignment can'\t be matched and I have to fall back to slower code.

I must admit I havn't tested multithreading yet, I have been working under the assumption that the overhead for spinning up/switching to threads is larger than the savings for these small matrix sizes.

0 Kudos
SergeyKostrov
Valued Contributor II
1,783 Views
>>...My main problem is when I have to offset one of the matrices by an odd number of columns and the other by an even number of >>columns, then the data alignment can't be matched and I have to fall back to slower code... It is Not clear how you've created your matricies. I'd like to share some experience and two different versions are used: [ Version 1 ] Some template class for a Matrix { ... _RTALIGN32 T *m_ptData1D; _RTALIGN32 T **m_ptData2D; ... }; Some method for initialization ... m_ptData1D = ( T * )CrtMalloc( m_uiSize * sizeof( T ) ); if( m_ptData1D == RTnull ) return ( RTbool )RTfalse; m_ptData2D = ( T ** )CrtMalloc( m_uiRows * sizeof( T * ) ); if( m_ptData2D == RTnull ) return ( RTbool )RTfalse; T *ptData = m_ptData1D; for( RTuint i = 0; i < m_uiRows; i++ ) { m_ptData2D = ptData; ptData += m_uiCols; } ... As you can see there are two pointers, ptData1D and ptData2D, and the underlying 1D array for a 2D array is a Contiguos and Always Aligned. [ Version 2 ] Some template class for a Data set of two matricies { ... _RTALIGN32 T **Tmp[2]; ... }; Some method for initialization ... for( i = 0; i < 2; i++ ) { Tmp = ( T ** )CrtCalloc( uiSize, sizeof( T * ) ); if( Tmp == RTnull ) return ( RTbool )RTfalse; for( j = 0; j < uiSize; j++ ) { Tmp = ( T * )CrtCalloc( uiSize, sizeof( T ) ); if( Tmp == RTnull ) return ( RTbool )RTfalse; } } ... As you can see during initialization an array of pointers for rows is allocated and then every row of size uiSize with a number of elements of type T is allocated ( represents a 2-D matrix, or a 2-D data set, or a 2-D image ). In both cases all pointers are alligned and with agressive optimizations by C++ compilers ( any! ) speed ups are significant (!). I remember that Not optimized and Not alligned versions worked for about 29 minutes in some cases. When all optimizations are On the same code works in less then 3 minutes.
0 Kudos
SergeyKostrov
Valued Contributor II
1,783 Views
>>My main problem is when I have to offset one of the matrices by an odd number of columns and the other by an even >>number of columns, then the data alignment can'\t be matched and I have to fall back to slower code... Henrik, Let me know if you need a demo ( small test case ) that demonstrates how to use the Version 1 technique. That is, underlying 1D array for a 2D array is a Contiguos and Always Aligned.
0 Kudos
Henrik_A_
Beginner
1,783 Views

I think you misunderstood what I meant with matrix offsets. The data for each image is in a single aligned array (e.g. 500x500 doubles aligned on 16/32 byte boundary, along with sizeX, sizeY, stride), but my calculation occasionally requires me to shift the data.

For example, the normal matrix addition case is A'[x,y] = A[x,y] + B[x,y]. Here, alignment is fine, also since the strides of both matrices match and the elements between [sizeX ... stride] are unused, I can use vector addition to compute this.

However, if I am shifting the data by a column, this becomes A'[x,y] = A[x,y] + B[x+1, y]. This calculation can be simplified to a matrix addition of two 499x499 matrices, by shifting the start offset of B' by one element, while keeping the stride the same. Now I have an aligned matrix A and an unaligned matrix B. Also, I can no longer just use vector addition because this would corrupt the last column of A (In this example, A'[x,y] would be A[x,y] + B[0, y+1].

0 Kudos
SergeyKostrov
Valued Contributor II
1,783 Views
>>...However, if I am shifting the data by a column, this becomes A'[x,y] = A[x,y] + B[x+1, y]. This calculation can be >>simplified to a matrix addition of two 499x499 matrices, by shifting the start offset of B' by one element, while keeping >>the stride the same. Now I have an aligned matrix A and an unaligned matrix B... Would you be able to create a generic reproducer of the problem?
0 Kudos
Henrik_A_
Beginner
1,783 Views

Sure, pseudo C++

struct Matrix
{
    int width, height, stride;
    double *data;
};

void AddToMatrix(Matrix *destMatrix, Matrix *sourceMatrix, long offsetX, long offsetY)
{
    // Skip parameter / size verification
    if (offsetX == 0 && offsetY == 0)
    {
        for (unsigned long y=0;y<sourcematrix->height;++y)
            for (unsigned long x=0;x<sourcematrix->width;++x)
                destMatrix->data[y*destMatrix->stride+x] = destMatrix->data[y*destMatrix->stride+x] + sourceMatrix->data[y*sourceMatrix->stride+x];
        return;
    }
    Matrix clippedDestMatrix = *destMatrix;
    Matrix clippedSourceMatrix = *sourceMatrix;
    if (offsetX != 0)
    {
        clippedDestMatrix.width -= abs(offsetX);
    clippedSourceMatrix.width -= abs(offsetX);
        if (offsetX < 0)
        {
             clippedSourceMatrix.data = clippedSourceMatrix.data + (-offsetX);
        }
        else
        {
             clippedDestMatrix.data = clippedDestMatrix.data + offsetX;
        }
    }
    // ditto for Y
    AddToMatrix(&clippedDestMatrix, &clippedSourceMatrix);
}

0 Kudos
SergeyKostrov
Valued Contributor II
1,783 Views
>>... >> Matrix clippedDestMatrix = *destMatrix; >> Matrix clippedSourceMatrix = *sourceMatrix; >>... You're creating local copies for both matricies to do all the rest processing and of course it takes some time ( especially when matricies are 8Kx8K or larger ). Why wouldn't you have additional member offset in your base Matrix struct? ... struct Matrix { int width, height, stride, offset; double *data; }; ...
0 Kudos
SergeyKostrov
Valued Contributor II
1,783 Views
>>...especially when matricies are 8Kx8K or larger... This is just for example and I remember that your matricies are smaller ( ~0.5Kx0.5K ).
0 Kudos
Henrik_A_
Beginner
1,783 Views

I'm not. The matrix struct just contains a pointer to the data, when I duplicate the matrix struct I just duplicate the pointer, not the memory containing the data itself. Your offset variable is equivalent to what I am doing when I modify the pointer in the matrix struct, except your method means every matrix manipulation function I write would have to know about the offset, my method means the functions don't know anything about the offset, they just are passed matrix structs with a modified width / base data pointer, and unusual stride.

0 Kudos
SergeyKostrov
Valued Contributor II
1,783 Views
>>...My main problem is when I have to offset one of the matrices by an odd number of columns and the other by >>an even number of columns, then the data alignment can'\t be matched and I have to fall back to slower code... Try to add some timing functions, like _rdtsc ( intrinsic ), or GetTickCount in case of a Windows OS, in your codes and compare outputs in order to understand which part is responsible for a performace decrease. Since you have two cases it won't be difficult to detect which part causes that problem.
0 Kudos
Reply