topic Re: Need help to improve efficiency in summation of numbers in Intel® oneAPI Math Kernel Library

Need help to improve efficiency in summation of numbers

henrikandresen — Wed, 20 Aug 2008 11:51:02 GMT

Dear All

So, a description of the setup. I have calculated a large amount of data, which can be viewed as a matrix 'V' of NxM values. I also have another matrix 'W' of the size Mx1. Both matrices are in allocated using a new double[N*M] and new double. A part result I'm looking for is:

result = V * W

This gives me a vector of Nx1 values.

If I have e.g. N = 5000 and M of 64, this takes around 20ms, which I find waaay to slow. I have implemented it using the following code, which also scales each value with another vector. Sorry in advance if the code has to be in a special box, but I can't use the insert-code-from-clipboard button.

for( nSample = 0; nSample < pLine->nPoints; nSample++ ){
	nBFPointIndex = nLine*pDataParam->pitchBeamform + nSample;
	pDataParam->pBeamform_r[nBFPointIndex] = cblas_ddot( nRcvElements, &(pSample_r[nSample]), pLine->nPoints, &(pRcvWeight[nSample]), pLine->nPoints );
	pDataParam->pBeamform_r[nBFPointIndex] *= pXmtWeight[nSample];
}


Is this simply a horrible implementation, or is it how I organize my data? Or is it a reasnoble amount of time for moving this data around?

Thank you for your time

/Henrik Andresen

P.S. I use ver. 9.x of the MKL, so I don't have the vsMul function available, and would like not to change version at the moment. Also, it is linked using 
the single-threaded version of the libraries if this has any influence.

Re: Need help to improve efficiency in summation of numbers

henrikandresen — Wed, 20 Aug 2008 13:16:33 GMT

Hey again

I actually found by a recommandation, that by just doing a simple for-loop implementation, it is much faster than previously.

Is it simply that I'm trying to use a function for something which it was not intended for?

/Henrik

Re: Need help to improve efficiency in summation of numbers

TimP — Wed, 20 Aug 2008 14:23:23 GMT

A compiler with auto-vectorization and OpenMP/parallel should always be able to at least equal the performance of BLAS ?dot, although it is not a simple question when both vectorization and threading are involved. Among the advantages of writing your own loop is that many of the cases taken care of at run time by BLAS are eliminated at compile time.