- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dear All
So, a description of the setup. I have calculated a large amount of data, which can be viewed as a matrix 'V' of NxM values. I also have another matrix 'W' of the size Mx1. Both matrices are in allocated using a new double[N*M] and new double. A part result I'm looking for is:
result = V * W
This gives me a vector of Nx1 values.
If I have e.g. N = 5000 and M of 64, this takes around 20ms, which I find waaay to slow. I have implemented it using the following code, which also scales each value with another vector. Sorry in advance if the code has to be in a special box, but I can't use the insert-code-from-clipboard button.
So, a description of the setup. I have calculated a large amount of data, which can be viewed as a matrix 'V' of NxM values. I also have another matrix 'W' of the size Mx1. Both matrices are in allocated using a new double[N*M] and new double
result = V * W
This gives me a vector of Nx1 values.
If I have e.g. N = 5000 and M of 64, this takes around 20ms, which I find waaay to slow. I have implemented it using the following code, which also scales each value with another vector. Sorry in advance if the code has to be in a special box, but I can't use the insert-code-from-clipboard button.
for( nSample = 0; nSample < pLine->nPoints; nSample++ ){
nBFPointIndex = nLine*pDataParam->pitchBeamform + nSample;
pDataParam->pBeamform_r[nBFPointIndex] = cblas_ddot( nRcvElements, &(pSample_r[nSample]), pLine->nPoints, &(pRcvWeight[nSample]), pLine->nPoints );
pDataParam->pBeamform_r[nBFPointIndex] *= pXmtWeight[nSample];
}
Is this simply a horrible implementation, or is it how I organize my data? Or is it a reasnoble amount of time for moving this data around?
Thank you for your time
/Henrik Andresen
P.S. I use ver. 9.x of the MKL, so I don't have the vsMul function available, and would like not to change version at the moment. Also, it is linked using
the single-threaded version of the libraries if this has any influence.
Link Copied
2 Replies
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hey again
I actually found by a recommandation, that by just doing a simple for-loop implementation, it is much faster than previously.
Is it simply that I'm trying to use a function for something which it was not intended for?
/Henrik
I actually found by a recommandation, that by just doing a simple for-loop implementation, it is much faster than previously.
Is it simply that I'm trying to use a function for something which it was not intended for?
/Henrik
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
A compiler with auto-vectorization and OpenMP/parallel should always be able to at least equal the performance of BLAS ?dot, although it is not a simple question when both vectorization and threading are involved. Among the advantages of writing your own loop is that many of the cases taken care of at run time by BLAS are eliminated at compile time.

Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page