Hello,

José_Luis_G_ · ‎12-23-2013

Hello,

I'm performing some benchmark using DGEMM from MKL and OpenBLAS (GotoBLAS successor). I'm using a piece of code similar to (I don't know why, but I can't put links in the post, but the piece of code comes from this MKL forum)

[cpp]

/* mkl.h is required for dsecnd and DGEMM */
#include <mkl.h>

/* initialization code is skipped for brevity (do a dummy dsecnd() call to improve accuracy of timing) */

double alpha = 1.0, beta = 1.0;
/* first call which does the thread/buffer initialization */
DGEMM(“N”, “N”, &m, &n, &k, &alpha, A, &m, B, &k, &beta, C, &m);
/* start timing after the first GEMM call */
double time_st = dsecnd();
for (i=0; i<LOOP_COUNT; ++i)
{
     DGEMM("N", "N", &m, &n, &k, &alpha, A, &m, B, &k, &beta, C, &m);
}
double time_end = dsecnd();
double time_avg = (time_end - time_st)/LOOP_COUNT;
double gflop = (2.0*m*n*k)*1E-9;
printf("Average time: %e secs n", time_avg);
printf("GFlop       : %.5f  n", gflop);
printf("GFlop/sec   : %.5f  n," gflop/time_avg);

[/cpp]

I change only the timing function when OpenBLAS is used, and I run the program using square matrices (and several repetitions) of size from 1000 to 5000.

Also, I take as reference the theoretical peak performance for my processor from intel com/support/processors/sb/CS-032819 htm (sorry the ugly link format). For the Core2 Duo P9600 (P9000 series) 2.66GHz, the theoretical peak using 2 cores is 21.328 GFLOPS/s. Running my program I obtain relative performances (R/Rmax) of about 95.2% using sizes between 3000 to 5000. This is a very good performance, so I congratulate Intel. Using OpenBLAS, the performance is very similar.

Then I've tested also the performance using only one thread. The document about theoretical peak does not inform about the performance using one thead, so I use as rmax the value 21.328/2 = 10.664 GFLOPS/s. Running the benchmark program I obtain results of about (for sizes 3000 to 5000) 10.68 to 10.76 GFLOPS/s, i.e. R/Rmax = 100.15% to 100.9% (!!!!). For OpenBLAS similar results are obtained too.

How it can be possible? How it can be possible to reach the theoretical peak performance? Is correct the way to calculate the theoretical peak for 1 thread as R2thread/2? How it can be explained the extrange value R/Rmax > 100% for 1 thread? Has anyone tested DGEMM using a similar processor?

The FLOP count for DGEMM is 2*M*N*K, that is divided between M*N*K products and M*N*K additions. Takes the same time a product as an addition or is slower?

Thanks

José_Luis_G_ · ‎12-23-2013

I had forgotten, I'm using icc 14.0.0.080 and MKL 11.1 under Debian GNU/Linux

TimP · ‎12-23-2013

You may have been first to investigate in detail performance of this old CPU with the current performance libraries.

I might have said that if you have Turbo Mode enabled, that would be expected to give additional relative boost for single thread. I note that ark dot intel dot com (yes, someone turned on blocking of URL in posts) says this model didn't have turbo mode, and doesn't specify instruction set (I'd guess sse4.1; early Core 2 Duo was ssse3 only). You might attempt experimental verification of single thread clock rate.

I'd refer you to the data sheet to check further, but it appears that both parallel multiply and parallel add can be issued and retired on the same clock cycle, so they should have the same peak throughput, and performance could be additive, although the add takes fewer cycles to retire.

Murat_G_Intel · ‎12-23-2013

In addition to Tim's suggestion, do you also observe > 100% efficiency if you use a different function to measure the execution time. You can either use system timer or omp_get_wtime. You may need to do more repetitions since other timers may have lower precision compared to dsecnd.

A change in system frequency may also impact the timer. You may try repeating the single-thread timing measurements after a system restart before running anything else.

José_Luis_G_ · ‎12-23-2013

Hello, and thank you for your answer,

Yes, some kind of turbo mode was my first suspicion. As you said, ark dot intel dot com says that this processor has not turbo mode nor hiperthreading. Even so, I inspected in the BIOS, but the only option related to the CPU performance was the possibility to deactivate multithreading in order to use only one core (the results were the same as disabling multithreading via the envinronments OMP_NUM_THREADS or MKL_NUM_THREADS at runtime). Could it be possible the actual speed where slightly higher than Intel says (the BIOS informs 2.66 GHz, as Intel)?

About the instruction set, again ark dot intel dot com says nothing, but executing cat /proc/cpuinfo on linux one can see that the CPU is SSE4.1 capable.

José_Luis_G_ · ‎12-23-2013

Hello, Murat, and thank you for your answer,

I used the functions dsecnd() from MKL and also clock_gettime(), which has nanosecond resolution, and the results are the same. The tests were repeated several times (5 for dimensions 3000 to 5000, which are enough) and were performed using the system in a non-gui environment, in order to save CPU from GUI and related daemons

Ying_H_Intel · ‎12-25-2013

Hello Jose,

I suppose the code is from the article http://software.intel.com/en-us/articles/a-simple-example-to-measure-the-performance-of-an-intel-mkl-function, right?

According my experience, the LOOP_COUNT 5 seems too small for 1024x1024 (some time, i even get negative time value). So may you try increase the LOOP_COUNT to 100 and see if there are changes?

and try other time measure like

#include <sys/time.h>

....
dgemm("N","N",&m,&n,&k,&alpha,a,&m,b,&k,&beta,c,&m);

gettimeofday(&stime, NULL);
for (i=0; i<LOOP_COUNT; ++i)
{
dgemm("N","N",&m,&n,&k,&alpha,a,&m,b,&k,&beta,c,&m);
}
gettimeofday(&etime,NULL);

timersub(&etime, &stime, &diff);
timeinsec = (diff.tv_sec*1000.0+diff.tv_usec/1000.0)/1000.0/LOOP_COUNT;

double gflop = (2.0*m*n*k)*1E-9;
printf("Average time:%f secs \n", timeinsec);
printf("GFlop/sec :%.5f \n", gflop/timeinsec);

Best Regards,

Ying

José_Luis_G_ · ‎12-25-2013

Hello, Ying G, and thank you for your answer,

note I'm observing the >100% performance on matrices from 3000x3000 to 5000x5000 (I've not tested greater dimensions). Yes, the code comes from the link you've posted. About the timers, I'm used the MKL dsecnd(), for which the documentation says anything about its resolution beyond it is double precision. Also I've used the omp_get_wtime() from OpenMP, which has a resolution equal or better the microsecond. Also I've used the POSIX clock_gettime(), which has a nanosecond resolution, so the timming I think is adequate. I've not used the gettimeofday() function because its resolution is in the order of microsecond. I've incresed the LOOP_COUNT to 10 for N=3000 (using 100 for this size is too large) and the results for 1 thread are the same

Ying_H_Intel · ‎12-26-2013

Hello Jose,

I guess, the time measure (error) is the key. as the computing time is like a blink, a little different make a difference too I try both dsecnd() and gettimeofday(). the time is a little different, gettimeofday() is bigger than dsecnd() when loop count is smaller.

[yhu5@snb01 dgemm.tar]$ ./xtest_ia32 3000 3000 3000 10
Size of Matrix A(mxk): 3000 x 3000
Size of Matrix B(kxn): 3000 x 3000
Size of Matrix C(mxn): 3000 x 3000
LOOP COUNT : 10
OMP: Info #204: KMP_AFFINITY: decoding x2APIC ids.
OMP: Info #202: KMP_AFFINITY: Affinity capable, using global cpuid leaf 11 info
OMP: Info #154: KMP_AFFINITY: Initial OS proc set respected: {0,1,2,3}
OMP: Info #156: KMP_AFFINITY: 4 available OS procs
OMP: Info #157: KMP_AFFINITY: Uniform topology
OMP: Info #179: KMP_AFFINITY: 1 packages x 4 cores/pkg x 1 threads/core (4 total cores)
OMP: Info #147: KMP_AFFINITY: Internal thread 0 bound to OS proc set {0,1,2,3}
Average time:2.109600e+00 secs from gettimeofday();
Average time:2.063253e+00 secs from dsecnd();
GFlop :54.00900
GFlop/sec :26.17662

[yhu5@snb01 dgemm.tar]$ ./xtest_ia32 3000 3000 3000 100
Size of Matrix A(mxk): 3000 x 3000
Size of Matrix B(kxn): 3000 x 3000
Size of Matrix C(mxn): 3000 x 3000
LOOP COUNT : 100
OMP: Info #204: KMP_AFFINITY: decoding x2APIC ids.
OMP: Info #202: KMP_AFFINITY: Affinity capable, using global cpuid leaf 11 info
OMP: Info #154: KMP_AFFINITY: Initial OS proc set respected: {0,1,2,3}
OMP: Info #156: KMP_AFFINITY: 4 available OS procs
OMP: Info #157: KMP_AFFINITY: Uniform topology
OMP: Info #179: KMP_AFFINITY: 1 packages x 4 cores/pkg x 1 threads/core (4 total cores)
OMP: Info #147: KMP_AFFINITY: Internal thread 0 bound to OS proc set {0,1,2,3}
Average time:2.069050e+00 secs from gettiemeofday
Average time:2.061695e+00 secs
GFlop :54.00900
GFlop/sec :26.19640

Best Regards,

Ying

TimP · ‎12-26-2013

In my past tests, omp_get_wtime() didn't give microsecond resolution, but I used it because it is more portable between Windows and linux than other alternatives and gives consistently better than millisecond resolution. On linux, it's usually a wrapper for gettimeofday().

icc supports __rdtsc() same as Microsoft; for other C compilers it's possible to use asm, paying attention to the differences between 32- and 64-bit modes. These methods surely are capable of timing 1000x1000 matrix multiplication. One way of avoiding run-time overhead when converting between rdtsc count and seconds is to build it in at compile time (taking the opportunity to eliminate division). That requires addressing the question of the actual tick count rate separately prior to building the test application.

Usual ways of measuring the tick rate involve making a sufficiently long loop which calibrates tick count against gettimeofday() for more than a second.

lmbench (bitmover dot com) includes a method for actually measuring average CPU clock rate over a period of a few seconds.

You never told us whether /proc/cpuinfo gives you the 2.66Ghz clock rate you are assuming.

José_Luis_G_ · ‎12-26-2013

Hello,

about the frequency, cat /proc/cpuinfo gives the 2.66 GHz, an also in the BIOS can be seen this number.

About the timer, I think clock_gettime() is accurate enough to measure the execution time, at least at the same level as gettimeofday(), ome_get_wtime() or dsecnd().

About the number of iterations to use, with matrix dimensions of about 3000 to 5000, 5 iterations are enough I think. The differences in performance using 5 or more (100 in the Ying example) are in the level of 0.01-0.1 GFLOPS/s, but the "problem" is about reaching the theoretical peak. Some time ago, I tested a Core i5 2500 (disabling turbo mode, of course) also with MKL and the performance using 1 thread was about 93%-95% R/Rpeak which is also very good performance and I think is more realistic.

I'm using this code for benchmarking and plot: https://bitbucket.org/jgpallero/pblb

Performance of DGEMM on Core2 Duo P9600 2.66 GHz