Only 10% speed increase for dgemv?

intel31 · ‎03-05-2008

Hi all

I am evaluating the Intel MKL to use them in financial applications (Monte Carlo etc). I get good speed increases for random number generation, but when doing matrix-vector multiplications I only get around 10% even though I would expect muchmore.

My timings are:

For n=2000, ITERATIONS = 1000:

MKL dgemv: 10.656 sec

Naive C++: 11.782 sec

For n=5000, ITERATIONS = 200:

MKL dgemv:12.828 sec

Naive C++: 13.985 sec

I have included the code below.

Microsoft Visual C++ 8

Intel Core Duo CPU, T2400, 1.83 GHz, 0.99 GB of RAM

Lenovo X60 laptop

Any hints on how to improve the performance?

Best regards,

Niels

#include

using

namespace std;

#include

"mkl.h"

void

testMatrixMul()

{

unsigned n = 2000;

const unsigned ITERATIONS = 1000;

double* A = new double[n*n];

double* x = new double;

double* y = new double;

/////

for (unsigned i=0; i

for (unsigned j=0; j

*(A + i * n + j) = 0.5 * i + 0.7 * j;

for (unsigned i=0; i

*(x+i) = 0.5 * i;

for (unsigned i=0; i

*(y+i) = 0.0;

double startTime, endTime, secondsElapsed;

cout <<

"Start" << endl;

startTime =

static_cast<double>(clock());

// y = alpha A x + beta y

for (unsigned< FONT size=2> k=0; k

{

cblas_dgemv(CblasRowMajor, CblasNoTrans, n, n,

1.0, A, n,

x, 1,

1.0, y, 1);

}

endTime = static_cast<double>(clock());

secondsElapsed = (endTime - startTime)/ double(CLOCKS_PER_SEC);

cout << "End" << endl;

cout << secondsElapsed << endl;

cout << endl;

/////

for (unsigned i=0; i

for (unsigned j=0; j

*(A + i * n + j) = 0.5 * i + 0.7 * j;

for (unsigned i=0; i

*(x+i) = 0.5 * i;

for (unsigned i=0; i

*(y+i) = 0.0;

cout <<

"Start" << endl;

startTime =

static_cast<double>(clock());

// y = alpha A x + beta y

for (unsigned k=0; k

{

for (unsigned i=0; i

{

double sum = 0.0;

for (unsigned j=0; j

sum += *(A + i * n + j) * *(x + j);

*(y + i) = 1.0 * sum + 1.0 * *(y + i);

}

endTime =

static_cast<double>(clock());

secondsElapsed = (endTime - startTime)/

double(CLOCKS_PER_SEC);

cout <<

"End" << endl;

cout << secondsElapsed << endl;

cout << endl;

}

int

main()

{

testMatrixMul();

}

TimP · ‎03-05-2008

I might speculate that your problem is large enough that MKL may choose not to thread at run time (assuming you chose the threaded option), so as to use the entire cache for a single thread. In that case, the only disadvantage your MSVC code would necessarily have is in not vectorizing the dot product. Did you check that MSVC is not optimizing away your ITERATIONS loop for the (not so naive) loop with a trivial shortcut?
To investigate further, a profiler such as PTU (see WhatIf forum) would be useful.

intel31 · ‎03-06-2008

Thank you for the quick reply.

I have not enabled thread support as the code will eventually run on a grid with more engines on each machine. So there we will not benefit from threading. But anyway it seems I cannot make threading work for gemv, whereas for gemm it works fine.

I tried with n = 14000, ITERATIONS = 1 and get

MKL: 0.547 seconds

C++: 0.594seconds

So it does not seems that the problem is that the ITERATIONS loop is optimised away.

Also, other tests I have run using other functions for matrix-vector multiplications from MKL and IPP seems to only give about 10% performance improvement. Is this all that I can expect or should I be able to get better numbers? I have looked in the documentation, butI cannot find any benchmarks saying how much I should expect. (I found a tablefor the random number generator in this document: http://software.intel.com/en-us/articles/monte-carlo-simulation-using-various-industry-library-solutions)

Can you point me to some document giving benchmarks? Or do you have other suggestions improving performance of matrix-vector multiplication?

Best regards,

Niels

TimP · ‎03-06-2008

Performance Benchmarks for Intel Math Kernel Library may resemble what you requested. It seems to be formatted so as not to be viewable on linux. I suspect it will cover dgemm but not dgemv.

Your observation about ?gemv generally agrees with mine, that there don't appear to be optimizations in MKL beyond what you get by compiling the public source code with a vectorizing compiler. In fact, the MKL is bug compatible with the public source, as to the problems which occur with zero length arrays. If you make your test cases large enough to be dominated by memory bandwidth, even the advantage of vectorization diminishes.
I believe you are correct, that there is little advantage to threading within ?gemv until the vector length is quite large (thousands), but then the speedup may be limited by memory bandwidth. You may have better opportunities for parallelism at a higher level in your application.

The greatest advantages of MKL dgemm over public source are likely to be where OpenMP parallel and cache blocking come into play. It uses unroll-and-jam optimization, which is inhibited by if(.... != 0) branching in the public source, so the public source is more optimized to the semi-sparse cases where loops can be skipped on account of zero operands. The advantage gained by unroll-and-jam is larger on Core architecture than on previous Intel architectures.

TimP · ‎03-06-2008

MKL 6 on P4 single core, publicity shots viewable only on Windows! I have seen more recent comparisons against Goto and the like, but apparently they aren't approved for public viewing.

intel31 · ‎03-06-2008

Hi Tim

Thank you for the answers.

I read the benchmark article. It is interesting, but it only compares performance of MKL 6 to MKL 6.1, but doesn't say how good it is compared to a standard C++ implementation compiled with a good compiler.

As for threading of GEMV: I found a place in the documentation (MKL User's guide, Managing Performance and Memory) which indirectly says the GEMV is not threaded:"Intel MKL is threaded in a number of places: direct sparse solver, LAPACK (*GETRF, *POTRF, *GBTRF, *GEQRF, *ORMQR, *STEQR, *BDSQR, *SPTRF, *SPTRS, *HPTRF, *HPTRS, *PPTRF, *PPTRS routines), all Level 3 BLAS, Sparse BLAS matrix-vector and matrix-matrix multiply routines for the compressed sparse row and diagonal formats, VML, and all FFTs (except 1D transformations when DFTI_NUMBER_OF_TRANSFORMS=1 and sizes are not power of two)."

But threading is not that relevant to me anyway as the calculations will be "threaded" using grid computing.

I guess I can conclude that I can only expect to see a 10% performance gain on GEMV compared to standard C++ code compiled with Visual C++ 2005.

Best regards,

Niels

TimP · ‎03-06-2008

There is a Windows Flash video at the following page with more recent performance quotations:
http://www3.intel.com/cd/software/products/asmo-na/eng/307757.htm
I haven't figured out how to view it on linux.

The MKL team has prepared also the following pages:
http://www3.intel.com/cd/software/products/asmo-na/eng/266858.htm
http://www3.intel.com/cd/software/products/asmo-na/eng/266861.htm
http://www3.intel.com/cd/software/products/asmo-na/eng/266852.htm
http://www3.intel.com/cd/software/products/asmo-na/eng/266863.htm
http://www3.intel.com/cd/software/products/asmo-na/eng/266864.htm
http://www3.intel.com/cd/software/products/asmo-na/eng/266857.htm

zhouxingchi · ‎03-18-2008

hi

i have tested the code on my machine

6.875 for MKL 10

22.312 for the VC++ looping

System i am using:

Windows XP

Intel (2) Duo E6750 @2.66GHz

Ram 3.56GB

VC++ 9.0

MKL 10.0.2.019