- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi all
I am evaluating the Intel MKL to use them in financial applications (Monte Carlo etc). I get good speed increases for random number generation, but when doing matrix-vector multiplications I only get around 10% even though I would expect muchmore.
My timings are:
For n=2000, ITERATIONS = 1000:
MKL dgemv: 10.656 sec
Naive C++: 11.782 sec
For n=5000, ITERATIONS = 200:
MKL dgemv:12.828 sec
Naive C++: 13.985 sec
I have included the code below.
Microsoft Visual C++ 8
Intel Core Duo CPU, T2400, 1.83 GHz, 0.99 GB of RAM
Lenovo X60 laptop
Any hints on how to improve the performance?
Best regards,
Niels
#include
#include
using
namespace std;#include
"mkl.h"void
testMatrixMul(){
unsigned n = 2000; const unsigned ITERATIONS = 1000; double* A = new double[n*n]; double* x = new double*(A + i * n + j) = 0.5 * i + 0.7 * j;
*(x+i) = 0.5 * i;
*(y+i) = 0.0;
cout <<
"Start" << endl;startTime =
static_cast<double>(clock()); // y = alpha A x + beta y for (unsigned< FONT size=2> k=0; k{
cblas_dgemv(CblasRowMajor, CblasNoTrans, n, n,
1.0, A, n,
x, 1,
1.0, y, 1);
}
endTime = static_cast<double>(clock());
secondsElapsed = (endTime - startTime)/ double(CLOCKS_PER_SEC);
cout << "End" << endl;
cout << secondsElapsed << endl;
cout << endl;
/////
for (unsigned i=0; i*(A + i * n + j) = 0.5 * i + 0.7 * j;
*(x+i) = 0.5 * i;
*(y+i) = 0.0;
cout <<
startTime =
static_cast<double>(clock()); // y = alpha A x + beta y for (unsigned k=0; k{
{
sum += *(A + i * n + j) * *(x + j);
*(y + i) = 1.0 * sum + 1.0 * *(y + i);
}
}
endTime =
secondsElapsed = (endTime - startTime)/
double(CLOCKS_PER_SEC);cout <<
"End" << endl;cout << secondsElapsed << endl;
cout << endl;
}
int
main(){
testMatrixMul();
}
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
To investigate further, a profiler such as PTU (see WhatIf forum) would be useful.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you for the quick reply.
I have not enabled thread support as the code will eventually run on a grid with more engines on each machine. So there we will not benefit from threading. But anyway it seems I cannot make threading work for gemv, whereas for gemm it works fine.
I tried with n = 14000, ITERATIONS = 1 and get
MKL: 0.547 seconds
C++: 0.594seconds
So it does not seems that the problem is that the ITERATIONS loop is optimised away.
Also, other tests I have run using other functions for matrix-vector multiplications from MKL and IPP seems to only give about 10% performance improvement. Is this all that I can expect or should I be able to get better numbers? I have looked in the documentation, butI cannot find any benchmarks saying how much I should expect. (I found a tablefor the random number generator in this document: http://software.intel.com/en-us/articles/monte-carlo-simulation-using-various-industry-library-solutions)
Can you point me to some document giving benchmarks? Or do you have other suggestions improving performance of matrix-vector multiplication?
Best regards,
Niels
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Your observation about ?gemv generally agrees with mine, that there don't appear to be optimizations in MKL beyond what you get by compiling the public source code with a vectorizing compiler. In fact, the MKL is bug compatible with the public source, as to the problems which occur with zero length arrays. If you make your test cases large enough to be dominated by memory bandwidth, even the advantage of vectorization diminishes.
I believe you are correct, that there is little advantage to threading within ?gemv until the vector length is quite large (thousands), but then the speedup may be limited by memory bandwidth. You may have better opportunities for parallelism at a higher level in your application.
The greatest advantages of MKL dgemm over public source are likely to be where OpenMP parallel and cache blocking come into play. It uses unroll-and-jam optimization, which is inhibited by if(.... != 0) branching in the public source, so the public source is more optimized to the semi-sparse cases where loops can be skipped on account of zero operands. The advantage gained by unroll-and-jam is larger on Core architecture than on previous Intel architectures.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Tim
Thank you for the answers.
I read the benchmark article. It is interesting, but it only compares performance of MKL 6 to MKL 6.1, but doesn't say how good it is compared to a standard C++ implementation compiled with a good compiler.
As for threading of GEMV: I found a place in the documentation (MKL User's guide, Managing Performance and Memory) which indirectly says the GEMV is not threaded:"Intel MKL is threaded in a number of places: direct sparse solver, LAPACK (*GETRF, *POTRF, *GBTRF, *GEQRF, *ORMQR, *STEQR, *BDSQR, *SPTRF, *SPTRS, *HPTRF, *HPTRS, *PPTRF, *PPTRS routines), all Level 3 BLAS, Sparse BLAS matrix-vector and matrix-matrix multiply routines for the compressed sparse row and diagonal formats, VML, and all FFTs (except 1D transformations when DFTI_NUMBER_OF_TRANSFORMS=1 and sizes are not power of two)."
But threading is not that relevant to me anyway as the calculations will be "threaded" using grid computing.
I guess I can conclude that I can only expect to see a 10% performance gain on GEMV compared to standard C++ code compiled with Visual C++ 2005.
Best regards,
Niels
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
http://www3.intel.com/cd/software/products/asmo-na/eng/307757.htm
I haven't figured out how to view it on linux.
The MKL team has prepared also the following pages:
http://www3.intel.com/cd/software/products/asmo-na/eng/266858.htm
http://www3.intel.com/cd/software/products/asmo-na/eng/266861.htm
http://www3.intel.com/cd/software/products/asmo-na/eng/266852.htm
http://www3.intel.com/cd/software/products/asmo-na/eng/266863.htm
http://www3.intel.com/cd/software/products/asmo-na/eng/266864.htm
http://www3.intel.com/cd/software/products/asmo-na/eng/266857.htm
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
hi
i have tested the code on my machine
6.875 for MKL 10
22.312 for the VC++ looping
System i am using:
Windows XP
Intel (2) Duo E6750 @2.66GHz
Ram 3.56GB
VC++ 9.0
MKL 10.0.2.019
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page