Time taken by the dgemm function to run.

Erasmo_Coletti · ‎04-16-2010

Hi,

I run the function dgemm used in MKL to do matrix-matrix multiplication into Visual studio 2008 in the release version. I compared the time it takes to run with the time taken by a UBlas function used to do matrix-matrix multiplication and the time taken by the code I wrote doing matrix multiplication elemenent by element.

The C++ code is as below:

vector > M1(n, n); //n is the size of the matrix

vector > M2(n, n);

vector > M3(n, n);

vector G1(n*n); //vector used in dgemm

vector G2(n*n);

vector G3(n*n);

boost::numeric::ublas::matrix U1(NRow1, NCol1), U2(NRow2, NCol2), U3(NRow1, NCol2);

int i, j, k, m;

double tp;

clock_t t1Start, t1End, t2Start, t2End, t3Start, t3End, t4Start, t4End;

double tick_per_sec(CLOCKS_PER_SEC);

double t1, Dt1(0.0), t2, Dt2(0.0), t3, Dt3(0.0), t4, Dt4(0.0), totCGIter(0.0);

for (i = 0; i < NRow1; ++i)

for (j = 0; j < NCol1; ++j)

{

M1 = 1; //initialise first matrix

G1[i*NRow1+j] = 1;

U1(i, j) = 1;

}

for (i = 0; i < NRow2; ++i)

for (j = 0; j < NCol2; ++j)

{

M2 = 2; //initialise second matrix

G2[i*NRow2+j] = 2;

U2(i, j) = 2;

}

char transa1 = 'N';
char transa2 = 'N';
char transb1 = 'N';
char transb2 = 'N';
double alpha = 1.0;
double beta = 0.0;

t1Start = clock();

for(i=1; i

dgemm(&transa1, &transb1, &NRow1, &NCol2, &NCol1, α, &G1[0], &NRow1, &G2[0], &NCol1, β, &G3[0], &NRow1);

t1End = clock();

t1 = (t1End - t1Start) / tick_per_sec;

t2Start = clock();

for(m=1; m

for(i=0; i

for(j=0; j

{

tp=0.0;

for(k=0; k

tp += M1*M2;

M3 = tp;

}

t2End = clock();

t2 = (t2End - t2Start) / tick_per_sec;

t3Start = clock();

for(m=1; m

U3 = prod(U1, U2);

t3End = clock();

t3 = (t3End - t3Start) / tick_per_sec;

The time taken by each function run is shown below for a 50 by 50 matrix. It is computed as the total time divided by the number of iterations:

iter 100, 200, 400, 600, 800, 1000, 1200, 1400

MKL 0.01766, 0.000, 0.01957, 0.0021, 0.033, 0.02964, 0.0188,0.01557

Manual 0.00219, 0.00257, 0.00215, 0.0021, 0.00205, 0.002, 0.0021, 0.0021

UBlas 0.00031, 0.00031, 0.000275, 0.000287, 0.000274, 0.00028, 0.00026, 0.000268

For UBlas and Manual the time is roughly stable. As I was expecting UBlas is much faster than Manual, 10 times; however, MKL is much slower.

Does anybody have any idea why is MKL slower?

Thank you.

Erasmo.

TimP · ‎04-17-2010

Did you run the same number of threads in each case? clock() attempts to give total CPU time for all threads, so would be expected to increase with number of threads, the objective being to reduce elapsed time by using more cores. If you wanted to avoid MKL threading, did you link the mkl sequential library?
You don't take any evident precautions to be certain that the compiler treats redundant looping the same in each case.
If you are using MSVC, it does appear unlikely that your mixed stride written out dot products will be optimized.

Gennady_F_Intel · ‎04-18-2010

Erasmo,

this is an expected results, because of MKL is highly optimized for large inputs. Coud you please, check this problem, when the matrices size will be, say 1000 x 1000.

--Gennady

Erasmo_Coletti · ‎04-23-2010

Gennady,

After statically linking the MKL library I am able to get reasonable results. Indeed, before I was doing a dynamic library linking.

In Visual studio 2008 and in Release mode, at the moment I get at least a 30-foldincrease inspeed if I compare the time taken by the MKL function to run to a simple code where I do an element by element matrix multiplication. I tested matrices of sizes from 50 by 50 to 1000 by 1000. For a 50 by 50 I get a 32-fold increase, for a 1000 by 1000 the increase is 107-fold.

Do you have any comment to the above?

Thank you.

Erasmo.