- Marcar como novo
- Marcador
- Subscrever
- Silenciar
- Subscrever fonte RSS
- Destacar
- Imprimir
- Denunciar conteúdo inapropriado
Hi all
I am evaluating the Intel MKL to use them in financial applications (Monte Carlo etc). I get good speed increases for random number generation, but when doing matrix-vector multiplications I only get around 10% even though I would expect muchmore.
My timings are:
For n=2000, ITERATIONS = 1000:
MKL dgemv: 10.656 sec
Naive C++: 11.782 sec
For n=5000, ITERATIONS = 200:
MKL dgemv:12.828 sec
Naive C++: 13.985 sec
I have included the code below.
Microsoft Visual C++ 8
Intel Core Duo CPU, T2400, 1.83 GHz, 0.99 GB of RAM
Lenovo X60 laptop
Any hints on how to improve the performance?
Best regards,
Niels
#include
#include
using
namespace std;#include
"mkl.h"void
testMatrixMul(){
unsigned n = 2000; const unsigned ITERATIONS = 1000; double* A = new double[n*n]; double* x = new double*(A + i * n + j) = 0.5 * i + 0.7 * j;
*(x+i) = 0.5 * i;
*(y+i) = 0.0;
cout <<
"Start" << endl;startTime =
static_cast<double>(clock()); // y = alpha A x + beta y for (unsigned< FONT size=2> k=0; k{
cblas_dgemv(CblasRowMajor, CblasNoTrans, n, n,
1.0, A, n,
x, 1,
1.0, y, 1);
}
endTime = static_cast<double>(clock());
secondsElapsed = (endTime - startTime)/ double(CLOCKS_PER_SEC);
cout << "End" << endl;
cout << secondsElapsed << endl;
cout << endl;
/////
for (unsigned i=0; i*(A + i * n + j) = 0.5 * i + 0.7 * j;
*(x+i) = 0.5 * i;
*(y+i) = 0.0;
cout <<
startTime =
static_cast<double>(clock()); // y = alpha A x + beta y for (unsigned k=0; k{
{
sum += *(A + i * n + j) * *(x + j);
*(y + i) = 1.0 * sum + 1.0 * *(y + i);
}
}
endTime =
secondsElapsed = (endTime - startTime)/
double(CLOCKS_PER_SEC);cout <<
"End" << endl;cout << secondsElapsed << endl;
cout << endl;
}
int
main(){
testMatrixMul();
}
Link copiado
- Marcar como novo
- Marcador
- Subscrever
- Silenciar
- Subscrever fonte RSS
- Destacar
- Imprimir
- Denunciar conteúdo inapropriado
To investigate further, a profiler such as PTU (see WhatIf forum) would be useful.
- Marcar como novo
- Marcador
- Subscrever
- Silenciar
- Subscrever fonte RSS
- Destacar
- Imprimir
- Denunciar conteúdo inapropriado
Thank you for the quick reply.
I have not enabled thread support as the code will eventually run on a grid with more engines on each machine. So there we will not benefit from threading. But anyway it seems I cannot make threading work for gemv, whereas for gemm it works fine.
I tried with n = 14000, ITERATIONS = 1 and get
MKL: 0.547 seconds
C++: 0.594seconds
So it does not seems that the problem is that the ITERATIONS loop is optimised away.
Also, other tests I have run using other functions for matrix-vector multiplications from MKL and IPP seems to only give about 10% performance improvement. Is this all that I can expect or should I be able to get better numbers? I have looked in the documentation, butI cannot find any benchmarks saying how much I should expect. (I found a tablefor the random number generator in this document: http://software.intel.com/en-us/articles/monte-carlo-simulation-using-various-industry-library-solutions)
Can you point me to some document giving benchmarks? Or do you have other suggestions improving performance of matrix-vector multiplication?
Best regards,
Niels
- Marcar como novo
- Marcador
- Subscrever
- Silenciar
- Subscrever fonte RSS
- Destacar
- Imprimir
- Denunciar conteúdo inapropriado
Your observation about ?gemv generally agrees with mine, that there don't appear to be optimizations in MKL beyond what you get by compiling the public source code with a vectorizing compiler. In fact, the MKL is bug compatible with the public source, as to the problems which occur with zero length arrays. If you make your test cases large enough to be dominated by memory bandwidth, even the advantage of vectorization diminishes.
I believe you are correct, that there is little advantage to threading within ?gemv until the vector length is quite large (thousands), but then the speedup may be limited by memory bandwidth. You may have better opportunities for parallelism at a higher level in your application.
The greatest advantages of MKL dgemm over public source are likely to be where OpenMP parallel and cache blocking come into play. It uses unroll-and-jam optimization, which is inhibited by if(.... != 0) branching in the public source, so the public source is more optimized to the semi-sparse cases where loops can be skipped on account of zero operands. The advantage gained by unroll-and-jam is larger on Core architecture than on previous Intel architectures.
- Marcar como novo
- Marcador
- Subscrever
- Silenciar
- Subscrever fonte RSS
- Destacar
- Imprimir
- Denunciar conteúdo inapropriado
- Marcar como novo
- Marcador
- Subscrever
- Silenciar
- Subscrever fonte RSS
- Destacar
- Imprimir
- Denunciar conteúdo inapropriado
Hi Tim
Thank you for the answers.
I read the benchmark article. It is interesting, but it only compares performance of MKL 6 to MKL 6.1, but doesn't say how good it is compared to a standard C++ implementation compiled with a good compiler.
As for threading of GEMV: I found a place in the documentation (MKL User's guide, Managing Performance and Memory) which indirectly says the GEMV is not threaded:"Intel MKL is threaded in a number of places: direct sparse solver, LAPACK (*GETRF, *POTRF, *GBTRF, *GEQRF, *ORMQR, *STEQR, *BDSQR, *SPTRF, *SPTRS, *HPTRF, *HPTRS, *PPTRF, *PPTRS routines), all Level 3 BLAS, Sparse BLAS matrix-vector and matrix-matrix multiply routines for the compressed sparse row and diagonal formats, VML, and all FFTs (except 1D transformations when DFTI_NUMBER_OF_TRANSFORMS=1 and sizes are not power of two)."
But threading is not that relevant to me anyway as the calculations will be "threaded" using grid computing.
I guess I can conclude that I can only expect to see a 10% performance gain on GEMV compared to standard C++ code compiled with Visual C++ 2005.
Best regards,
Niels
- Marcar como novo
- Marcador
- Subscrever
- Silenciar
- Subscrever fonte RSS
- Destacar
- Imprimir
- Denunciar conteúdo inapropriado
http://www3.intel.com/cd/software/products/asmo-na/eng/307757.htm
I haven't figured out how to view it on linux.
The MKL team has prepared also the following pages:
http://www3.intel.com/cd/software/products/asmo-na/eng/266858.htm
http://www3.intel.com/cd/software/products/asmo-na/eng/266861.htm
http://www3.intel.com/cd/software/products/asmo-na/eng/266852.htm
http://www3.intel.com/cd/software/products/asmo-na/eng/266863.htm
http://www3.intel.com/cd/software/products/asmo-na/eng/266864.htm
http://www3.intel.com/cd/software/products/asmo-na/eng/266857.htm
- Marcar como novo
- Marcador
- Subscrever
- Silenciar
- Subscrever fonte RSS
- Destacar
- Imprimir
- Denunciar conteúdo inapropriado
hi
i have tested the code on my machine
6.875 for MKL 10
22.312 for the VC++ looping
System i am using:
Windows XP
Intel (2) Duo E6750 @2.66GHz
Ram 3.56GB
VC++ 9.0
MKL 10.0.2.019
- Subscrever fonte RSS
- Marcar tópico como novo
- Marcar tópico como lido
- Flutuar este Tópico para o utilizador atual
- Marcador
- Subscrever
- Página amigável para impressora