topic I think you misunderstood in Intel® oneAPI Math Kernel Library

Best function for inplace matrix addition (w. stride)

Henrik_A_ — Thu, 06 Jun 2013 11:15:32 GMT

I often need to calculate the sum of a set of matrices or submatrices of a dataset. Unfortunately the two matrices do not always have the same stride, when I am selectively using a subset of a large dataset, which means I have to resort to calculating the sum by hand (alternatively, I could call vkadd or similar once per row, I'm not sure how much overhead this implies when calling vkadd 500 or 1000 times for a 500x500 matrix).

I am aware of the mkl_?omatadd function, but the documentation states that the input and output arrays cannot overlap, which means I would need an extra temporary matrix. While I would assume calculating A = A + m * B works inplace when not transposing matrices, unless this can be guaranteed for all future versions I cannot use that approach.

Are there any other functions which could be used for this calculation I have missed?

Hi Henrik,

Dmitry_B_Intel — Fri, 07 Jun 2013 03:34:47 GMT

Hi Henrik,

BLAS level 1 functions ?axpy may help you, as they do in-place operation on vectors: y=a*x + y. When applied row-by-row (or col-by-col) in a loop, this operation can accomodate any combination of strides. The loop may be sped up by parallelization with '#pragma omp parallel for'.

Dima

>>I often need to calculate

SergeyKostrov — Fri, 07 Jun 2013 05:09:00 GMT

>>I often need to calculate the sum of a set of matrices or submatrices of a dataset... Matrix additions and subtractions are at the core of Strassen's algorithm for matrix multiplication. I've spent a significant amount of time on implementation ( 4 different versions ) and optimization of these algorithms. I'd like to give you two really small examples: [ Version 1 - Template based compiled with /O2 or /O3 ( aggressive ) optimizations ] ... inline RTvoid Add( ... ) { #ifdef _RTMATRIXSET_DIAGNOSTICS RTuint64 uiClockS = CrtRdtsc(); #endif RTuint i; RTuint j; for( i = 0; i < uiSize; i++ ) { for( j = 0; j < uiSize; j += 4 ) { register T tS0 = tA[j ] + tB[j ]; register T tS1 = tA[j+1] + tB[j+1]; register T tS2 = tA[j+2] + tB[j+2]; register T tS3 = tA[j+3] + tB[j+3]; tC[j ] = tS0; tC[j+1] = tS1; tC[j+2] = tS2; tC[j+3] = tS3; } } #ifdef _RTMATRIXSET_DIAGNOSTICS RTuint64 uiClockE = CrtRdtsc(); CrtPrintf( RTU("Add - Completed in %.3f ms\n"), ( RTfloat )( uiClockE - uiClockS ) / 1000000.0f ); #endif }; ... and [ Version 2 - IPP based with ippsAdd_32f function ] ... inline RTvoid Add( ... ) { #ifdef _RTMATRIXSET_DIAGNOSTICS RTuint64 uiClockS = CrtRdtsc(); #endif RTuint i; for( i = 0; i < uiSize; i++ ) { ::ippsAdd_32f( ( const float * )&tA[0], ( const float * )&tB[0], ( float * )&tC[0], ( RTint )uiSize ); } #ifdef _RTMATRIXSET_DIAGNOSTICS RTuint64 uiClockE = CrtRdtsc(); CrtPrintf( RTU("Add - Completed in %.3f ms\n"), ( RTfloat )( uiClockE - uiClockS ) / 1000000.0f ); #endif }; ... After extensive testing on several hardware platforms, like Ivy Bridge, Atom, Pentium 4, I din't see a significant difference in performance between these two very simple functions. I could easily post real performance numbers for any mentioned platforms ( if you need, of course ). >>...I am aware of the mkl_?omatadd function, but the documentation states that the input and output arrays >>cannot overlap... I'm considering to try that function ( for tests only ).

>>...The loop may be sped up

SergeyKostrov — Fri, 07 Jun 2013 05:24:52 GMT

>>...The loop may be sped up by parallelization with '#pragma omp parallel for'... It is a very useful for large matricies but there is a question here: Does it make sence to do it for two 500x500 matricies? I'll post tomorrow performance numbers for addition of two 512x512 matricies ( without OpenMP ) on Ivy Bridge system.

Thanks for the replies.

Henrik_A_ — Fri, 07 Jun 2013 15:05:35 GMT

Thanks for the replies.

Dmitry: I think that would be almost identical to using vkadd, the blas function has the additional scaling factor but I am assuming it also contains an optimized case for unscaled addition.

Sergey: That code actually looks very similar to my current approach - I have a function which does addition of double vectors using unrolled SSE intrinsics, and am calling that function on a row by row basis. Assuming sufficient compiler optimization the resulting asm of your first function should look very similar. (Ignoring the missing special cases for lengths != 4 * N). My main problem is when I have to offset one of the matrices by an odd number of columns and the other by an even number of columns, then the data alignment can'\t be matched and I have to fall back to slower code.

I must admit I havn't tested multithreading yet, I have been working under the assumption that the overhead for spinning up/switching to threads is larger than the savings for these small matrix sizes.

>>...My main problem is when

SergeyKostrov — Fri, 07 Jun 2013 23:01:00 GMT

>>...My main problem is when I have to offset one of the matrices by an odd number of columns and the other by an even number of >>columns, then the data alignment can't be matched and I have to fall back to slower code... It is Not clear how you've created your matricies. I'd like to share some experience and two different versions are used: [ Version 1 ] Some template class for a Matrix { ... _RTALIGN32 T *m_ptData1D; _RTALIGN32 T **m_ptData2D; ... }; Some method for initialization ... m_ptData1D = ( T * )CrtMalloc( m_uiSize * sizeof( T ) ); if( m_ptData1D == RTnull ) return ( RTbool )RTfalse; m_ptData2D = ( T ** )CrtMalloc( m_uiRows * sizeof( T * ) ); if( m_ptData2D == RTnull ) return ( RTbool )RTfalse; T *ptData = m_ptData1D; for( RTuint i = 0; i < m_uiRows; i++ ) { m_ptData2D = ptData; ptData += m_uiCols; } ... As you can see there are two pointers, ptData1D and ptData2D, and the underlying 1D array for a 2D array is a Contiguos and Always Aligned. [ Version 2 ] Some template class for a Data set of two matricies { ... _RTALIGN32 T **Tmp[2]; ... }; Some method for initialization ... for( i = 0; i < 2; i++ ) { Tmp = ( T ** )CrtCalloc( uiSize, sizeof( T * ) ); if( Tmp == RTnull ) return ( RTbool )RTfalse; for( j = 0; j < uiSize; j++ ) { Tmp = ( T * )CrtCalloc( uiSize, sizeof( T ) ); if( Tmp == RTnull ) return ( RTbool )RTfalse; } } ... As you can see during initialization an array of pointers for rows is allocated and then every row of size uiSize with a number of elements of type T is allocated ( represents a 2-D matrix, or a 2-D data set, or a 2-D image ). In both cases all pointers are alligned and with agressive optimizations by C++ compilers ( any! ) speed ups are significant (!). I remember that Not optimized and Not alligned versions worked for about 29 minutes in some cases. When all optimizations are On the same code works in less then 3 minutes.

>>My main problem is when I

SergeyKostrov — Sat, 08 Jun 2013 01:08:21 GMT

>>My main problem is when I have to offset one of the matrices by an odd number of columns and the other by an even >>number of columns, then the data alignment can'\t be matched and I have to fall back to slower code... Henrik, Let me know if you need a demo ( small test case ) that demonstrates how to use the Version 1 technique. That is, underlying 1D array for a 2D array is a Contiguos and Always Aligned.

I think you misunderstood

Henrik_A_ — Mon, 10 Jun 2013 10:39:04 GMT

I think you misunderstood what I meant with matrix offsets. The data for each image is in a single aligned array (e.g. 500x500 doubles aligned on 16/32 byte boundary, along with sizeX, sizeY, stride), but my calculation occasionally requires me to shift the data.

For example, the normal matrix addition case is A'[x,y] = A[x,y] + B[x,y]. Here, alignment is fine, also since the strides of both matrices match and the elements between [sizeX ... stride] are unused, I can use vector addition to compute this.

However, if I am shifting the data by a column, this becomes A'[x,y] = A[x,y] + B[x+1, y]. This calculation can be simplified to a matrix addition of two 499x499 matrices, by shifting the start offset of B' by one element, while keeping the stride the same. Now I have an aligned matrix A and an unaligned matrix B. Also, I can no longer just use vector addition because this would corrupt the last column of A (In this example, A'[x,y] would be A[x,y] + B[0, y+1].

>>...However, if I am

SergeyKostrov — Tue, 11 Jun 2013 01:27:05 GMT

>>...However, if I am shifting the data by a column, this becomes A'[x,y] = A[x,y] + B[x+1, y]. This calculation can be >>simplified to a matrix addition of two 499x499 matrices, by shifting the start offset of B' by one element, while keeping >>the stride the same. Now I have an aligned matrix A and an unaligned matrix B... Would you be able to create a generic reproducer of the problem?

Sure, pseudo C++

Henrik_A_ — Thu, 13 Jun 2013 16:22:03 GMT

Sure, pseudo C++

struct Matrix
{
int width, height, stride;
double *data;
};

void AddToMatrix(Matrix *destMatrix, Matrix *sourceMatrix, long offsetX, long offsetY)
{
    // Skip parameter / size verification
    if (offsetX == 0 && offsetY == 0)
    {
        for (unsigned long y=0;y<sourcematrix->height;++y)
            for (unsigned long x=0;x<sourcematrix->width;++x)
                destMatrix->data[y*destMatrix->stride+x] = destMatrix->data[y*destMatrix->stride+x] + sourceMatrix->data[y*sourceMatrix->stride+x];
        return;
    }
    Matrix clippedDestMatrix = *destMatrix;
    Matrix clippedSourceMatrix = *sourceMatrix;
    if (offsetX != 0)
    {
        clippedDestMatrix.width -= abs(offsetX);
   clippedSourceMatrix.width -= abs(offsetX);
        if (offsetX < 0)
        {
             clippedSourceMatrix.data = clippedSourceMatrix.data + (-offsetX);
        }
        else
        {
             clippedDestMatrix.data = clippedDestMatrix.data + offsetX;
        }
    }
    // ditto for Y
    AddToMatrix(&clippedDestMatrix, &clippedSourceMatrix);
}

>>...

SergeyKostrov — Fri, 14 Jun 2013 01:07:47 GMT

>>... >> Matrix clippedDestMatrix = *destMatrix; >> Matrix clippedSourceMatrix = *sourceMatrix; >>... You're creating local copies for both matricies to do all the rest processing and of course it takes some time ( especially when matricies are 8Kx8K or larger ). Why wouldn't you have additional member offset in your base Matrix struct? ... struct Matrix { int width, height, stride, offset; double *data; }; ...

>>...especially when

SergeyKostrov — Fri, 14 Jun 2013 01:10:28 GMT

>>...especially when matricies are 8Kx8K or larger... This is just for example and I remember that your matricies are smaller ( ~0.5Kx0.5K ).

I'm not. The matrix struct

Henrik_A_ — Fri, 14 Jun 2013 08:20:13 GMT

I'm not. The matrix struct just contains a pointer to the data, when I duplicate the matrix struct I just duplicate the pointer, not the memory containing the data itself. Your offset variable is equivalent to what I am doing when I modify the pointer in the matrix struct, except your method means every matrix manipulation function I write would have to know about the offset, my method means the functions don't know anything about the offset, they just are passed matrix structs with a modified width / base data pointer, and unusual stride.

>>...My main problem is when

SergeyKostrov — Fri, 14 Jun 2013 14:10:03 GMT

>>...My main problem is when I have to offset one of the matrices by an odd number of columns and the other by >>an even number of columns, then the data alignment can'\t be matched and I have to fall back to slower code... Try to add some timing functions, like _rdtsc ( intrinsic ), or GetTickCount in case of a Windows OS, in your codes and compare outputs in order to understand which part is responsible for a performace decrease. Since you have two cases it won't be difficult to detect which part causes that problem.