Need help to improve efficiency in summation of numbers

henrikandresen · ‎08-20-2008

Dear All

So, a description of the setup. I have calculated a large amount of data, which can be viewed as a matrix 'V' of NxM values. I also have another matrix 'W' of the size Mx1. Both matrices are in allocated using a new double[N*M] and new double. A part result I'm looking for is:

result = V * W

This gives me a vector of Nx1 values.

If I have e.g. N = 5000 and M of 64, this takes around 20ms, which I find waaay to slow. I have implemented it using the following code, which also scales each value with another vector. Sorry in advance if the code has to be in a special box, but I can't use the insert-code-from-clipboard button.

for( nSample = 0; nSample < pLine->nPoints; nSample++ ){
	nBFPointIndex = nLine*pDataParam->pitchBeamform + nSample;
	pDataParam->pBeamform_r[nBFPointIndex] = cblas_ddot( nRcvElements, &(pSample_r[nSample]), pLine->nPoints, &(pRcvWeight[nSample]), pLine->nPoints );
	pDataParam->pBeamform_r[nBFPointIndex] *= pXmtWeight[nSample];
}


Is this simply a horrible implementation, or is it how I organize my data? Or is it a reasnoble amount of time for moving this data around?

Thank you for your time

/Henrik Andresen

P.S. I use ver. 9.x of the MKL, so I don't have the vsMul function available, and would like not to change version at the moment. Also, it is linked using 
the single-threaded version of the libraries if this has any influence.

henrikandresen · ‎08-20-2008

Hey again

I actually found by a recommandation, that by just doing a simple for-loop implementation, it is much faster than previously.

Is it simply that I'm trying to use a function for something which it was not intended for?

/Henrik

TimP · ‎08-20-2008

A compiler with auto-vectorization and OpenMP/parallel should always be able to at least equal the performance of BLAS ?dot, although it is not a simple question when both vectorization and threading are involved. Among the advantages of writing your own loop is that many of the cases taken care of at run time by BLAS are eliminated at compile time.