Performance Evaluation of a Matrix Multiply: 2048 x 2048 \\ Data type 'float' \\ All matrix elements 1.0f

SergeyKostrov · ‎12-16-2011

NOTE: I'msorry, but I decided to post again becausemy previous post became "deviated" from the subject.

Guys, if you have some time andcouldprovide some performancenumbers, obtained with any
version of MKL,I really appreciate it! If you can't... sorry that my post took a couple of seconds
of your valuabletime.

THIS IS WHAT I NEED:

I wonder if somebody, who has anMKL, could do a Performance Evaluation of aMatrix Multiplication function?

Test-Case:

- Both matrices2048 x 2048
- Data type 'float'
- All Elements Initialized to 1.0f

Please report aTime ( in secs )to Calculate aProduct of two matricesand somedetailsabout your CPU,
frequency, memory in GBs, etc.

I'm not interested in aresult of multiplication. I'm interested to know how longit takes to calculate it on
different computers with different CPUs using Intel'sMKL.

Thank you in advance.

Best regards,
Sergey

Todd_R_Intel · ‎12-20-2011

Sergey,

We post a number of benchmarks on our website but we don't expect that it will ever cover all customer questions. There are simply too many permutations.

Even your question above, leads to some other question... What OS? Are matrices transposed or not? You say both matrices, so is the third matrix in SGEMM, "C" zeroed with beta equal to 0?

And then naturally, there will be required full documentation and disclaimers when Intel posts some benchmark number.

So you see, what seems like a simple request can become a slightly bigger request, so we do our best here to provide some representative performance numbers that give an indication of the kinds of results you can get with Intel MKL and then for the other cases we provide a free evaluation copy of the fully functional version of Intel MKL so that you can give it a try on the case that is important to you.

Todd

SergeyKostrov · ‎12-20-2011

>>We post a number of benchmarks on our website but we don't expect...

These benchmarks are in 'Gflops', not in'Seconds'.

>>Even your question above, leads to some other question... What OS?

Any OS. No special requirements andwhatever is best for you. A computer with a latest or
older ( 1 - 2 year old )IntelCPU would be OK.

>>Are matrices transposed or not?

No. All matrix elements are initialized to 1.0. Both matrices are square, 2048 by 2048, it means that
it doesn't matter if you transposesome matrixor not. It will be the same.

>>You say both matrices, so is the third matrix in SGEMM, "C" zeroed with beta equal to 0?

Here is a C-pseudo code:

...
float fA[2048][2048];// Matrix A
float fB[2048][2048]; // Matrix B
float fC[2048][2048]; // Matrix C

for( int i=0; i<2048; i++)
{
for( int j=0; j<2048; j++ )
{
fA=1.0f;
fB=1.0f;
fC=0.0f;
}
}

t1 = GetTime();
fC = < MKLMatrixMultiply >( fA, fB );// Any MKL version
t2 = GetTime();

Delta = t2 - t1; // Time to multiply (in seconds, for example )
...

As you can see I don't need something really special.

Best regards,
Sergey

Murat_G_Intel · ‎12-21-2011

Hi Sergey,

We report the performance numbersin flops (flop/sec), which is the number offloating point operations(flop)per second (sec). You can find the time required for a routine if you know flop and flop/sec.

For example, the number of floating point operations to compute SGEMM with M=N=K=2048,beta=0.0, alpha=1.0is given as:

2*M*N*K= 2*2048*2048*2048 = 17179869184 flop ~= 17.180 Giga-Flop (GFlop)

Now, if SGEMM runs at 200 GFlop/sec (or GFlops), then the time for SGEMM will be:

17.180 / 200 = 0.0859 secs

Double-precision GEMM (DGEMM) is shown on the performance charts, and as a rule-of-thumb, the single-precision performance is two times of the double-precision performance. Therefore, you can multiply the DGEMM GFlops by two to get an estimate of SGEMM GFlops.

Best wishes,

Efe

SergeyKostrov · ‎12-21-2011

Hi Efe,

Even if it issome kind of "calculated performance", not measured,it gives me better ideaabout performance of MKL.

I have a question. What is a number '2' in:

2*M*N*K= 2*2048*2048*2048 = 17179869184 flop ~= 17.180 Giga-Flop (GFlop)
^

Thank you for your time!

Best regards,
Sergey

Gennady_F_Intel · ‎12-21-2011

this is the number of multiplications and additions.

SergeyKostrov · ‎12-22-2011

>>...Now, if SGEMM runs at 200 GFlop/sec (or GFlops )

Question1:
What modernIntel's CPUs provide such performance?

Question 2:
I also would like to compare performance gainsrelative tosome older Intel CPUs, for example
Pentium 4 or Atom N270. So, how fast are they in terms of number of floating point operations in a second?

Best regards,
Sergey

TimP · ‎12-22-2011

Quoting Sergey Kostrov

>>...Now, if SGEMM runs at 200 GFlop/sec (or GFlops )

Question1:
What modernIntel's CPUs provide such performance?

Question 2:
I also would like to compare performance gainsrelative tosome older Intel CPUs, for example
Pentium 4 or Atom N270. So, how fast are they in terms of number of floating point operations in a second?

An AVX CPU, even without fma, would have a peak rating of 16 single precision flop per core per Hz clock speed. So you are talking about e.g. an 8 core CPU at 2Ghz.
Most of the recent new entries on Top500 are exceeding 200 Gflops DGEMM per node (2 CPUs) and 80% "efficiency" (actual vs. peak rated performance), and that is sustained for over 10000 cores.
This (for P4, Atom), .... has been covered many times over in public internet posts.

SergeyKostrov · ‎12-22-2011

>>...2*M*N*K= 2*2048*2048*2048

It looks likeafamous T=O*(n^3) and O equals to '2'.

I'm not convinced that a classic (single-thread) algorithm for matrix multiplication is at the core of MLK's
SGEMM or DGEMM functions. I think Strassen or Strassen-Winograd algorithmshave to be used to boost a
speed ofcalculations.

SergeyKostrov · ‎12-24-2011

Merry Christmas and a Happy New Year!

Thanks to everybody who responded to my posts.

Best regards,
Sergey