no multithreading on small matrices?

tletni · ‎06-19-2012

I observed that multithreading kicks in only for large matrices. The function below, compiled as below, is once fed (in a for loop of 10000) with small matrices and once with large matrices, and I saw once, only one of my core is used, and the other time both cores work.
when I have 100x10 and 10x10 matrices, no multithreading is engaged. with 200x10 and 10x10, multithreading is engaged.

Are there any rules of thumb, also for other procedured than gemm? dcopy, dsctr, dsyrk, dpotri, dsymm, dgthr, daxpy

Aside, I wondered what is the difference between "cblas_dcopy()" and "dcopy()".

Thanks
T

#include
#include
#include
#include
#include //geom p q n

void mttest(double *a, double *b, int *geom, double *c) {
double one = 1.0; double zero = 0;
dgemm("n","n",&geom[0],&geom[3],&geom[1],&one,a,&geom[0],b,&geom[2],&zero,c,&geom[0]);
}

gcc -std=gnu99 -fpic -fmessage-length=0 -O2 -Wall -D_FORTIFY_SOURCE=2 -fstack-protector -funwind-tables -fasynchronous-unwind-tables -g -c mttest.c -o mttest.o
gcc -std=gnu99 -shared -L/opt/intel/composer_xe_2011_sp1.9.293/mkl/lib/intel64 -L/opt/intel/composer_xe_2011_sp1.9.293/compiler/lib/intel64 -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -liomp5 -lpthread -lm mttest.o -o mttest.so

TimP · ‎06-19-2012

MKL threaded functions contains the equivalent of omp if() clause to avoid performance degradation by threading on cases which are too small.
cblas_ wrappers accept value operands where appropriate and conform them with the Fortran default. They are open source code; look for yourself. Most C compilers know how to compile data moves in open C code or , so dcopy() would rarely be used.

SergeyKostrov · ‎06-20-2012

Quoting tletni

...when I have 100x10 and 10x10 matrices, no multithreading is engaged. with 200x10 and 10x10, multithreading is engaged...

Amultithreadingwould create a negative impact on overall performance if matrix sizes are too small ( less
then 128x128 )because of someoverhead related tocreation of threads. For example, if two matriceshave to
bemultiplied usingStrassen andClassic algorithms real performance improvements willhappen if sizes greater
than 128x128. Strassen algorithmdoes calculations faster even when onethread is used. I could provide some
real data if needed.

SergeyKostrov · ‎06-21-2012

Quoting Sergey Kostrov

Quoting tletni
...when I have 100x10 and 10x10 matrices, no multithreading is engaged. with 200x10 and 10x10, multithreading is engaged...

...Strassen algorithmdoes calculations faster even when onethread is used. I could provide some
real data if needed.

Here are performance results ( Operation - Matrix multiplication ).

Size of both matrices: 128x128

Matrix Size : 128 x 128
Matrix Size Threshold: N/A
Matrix Partitions : N/A
ResultSets Reflection: N/A
Calculating...
Classic A - Pass 1 - Completed: 0.03100 secs
Classic A - Pass 2 - Completed: 0.03100 secs
Classic A - Pass 3 - Completed: 0.01600 secs
Classic A - Pass 4 - Completed: 0.03100 secs
Classic A - Pass 5 - Completed: 0.01600 secs

Strassen HBI
Matrix Size : 128 x 128
Matrix Size Threshold: 64 x 64
Matrix Partitions : 1
ResultSets Reflection: N/A
Calculating...
Strassen HBI - Pass 1 - Completed: 0.01500 secs
Strassen HBI - Pass 2 - Completed: 0.03100 secs
Strassen HBI - Pass 3 - Completed: 0.01600 secs
Strassen HBI - Pass 4 - Completed: 0.01600 secs
Strassen HBI - Pass 5 - Completed: 0.03100 secs

Strassen HBC
Matrix Size : 128 x 128
Matrix Size Threshold: 8 x 8
Matrix Partitions : 2801
ResultSets Reflection: Enabled
Calculating...
Strassen HBC - Pass 1 - Completed: 0.12500 secs
Strassen HBC - Pass 2 - Completed: 0.03100 secs
Strassen HBC - Pass 3 - Completed: 0.03100 secs
Strassen HBC - Pass 4 - Completed: 0.03200 secs
Strassen HBC - Pass 5 - Completed: 0.01500 secs

Size of both matrices: 256x256

Matrix Size : 256 x 256
Matrix Size Threshold: N/A
Matrix Partitions : N/A
ResultSets Reflection: N/A
Calculating...
Classic A - Pass 1 - Completed: 0.59400 secs
Classic A - Pass 2 - Completed: 0.60900 secs
Classic A - Pass 3 - Completed: 0.59400 secs
Classic A - Pass 4 - Completed: 0.59400 secs
Classic A - Pass 5 - Completed: 0.60900 secs

Strassen HBI
Matrix Size : 256 x 256
Matrix Size Threshold: 128 x 128
Matrix Partitions : 1
ResultSets Reflection: N/A
Calculating...
Strassen HBI - Pass 1 - Completed: 0.17200 secs
Strassen HBI - Pass 2 - Completed: 0.17200 secs
Strassen HBI - Pass 3 - Completed: 0.15600 secs
Strassen HBI - Pass 4 - Completed: 0.17200 secs
Strassen HBI - Pass 5 - Completed: 0.17200 secs

Strassen HBC
Matrix Size : 256 x 256
Matrix Size Threshold: 16 x 16
Matrix Partitions : 2801
ResultSets Reflection: Enabled
Calculating...
Strassen HBC - Pass 1 - Completed: 0.37500 secs
Strassen HBC - Pass 2 - Completed: 0.17200 secs
Strassen HBC - Pass 3 - Completed: 0.17200 secs
Strassen HBC - Pass 4 - Completed: 0.17200 secs
Strassen HBC - Pass 5 - Completed: 0.17200 secs

Size of both matrices: 512x512

Matrix Size : 512 x 512
Matrix Size Threshold: N/A
Matrix Partitions : N/A
ResultSets Reflection: N/A
Calculating...
Classic A - Pass 1 - Completed: 10.81200 secs
Classic A - Pass 2 - Completed: 10.84400 secs
Classic A - Pass 3 - Completed: 10.82800 secs
Classic A - Pass 4 - Completed: 10.82800 secs
Classic A - Pass 5 - Completed: 10.82800 secs

Strassen HBI
Matrix Size : 512 x 512
Matrix Size Threshold: 256 x 256
Matrix Partitions : 1
ResultSets Reflection: N/A
Calculating...
Strassen HBI - Pass 1 - Completed: 1.39100 secs
Strassen HBI - Pass 2 - Completed: 1.37500 secs
Strassen HBI - Pass 3 - Completed: 1.35900 secs
Strassen HBI - Pass 4 - Completed: 1.37500 secs
Strassen HBI - Pass 5 - Completed: 1.37500 secs

Strassen HBC
Matrix Size : 512 x 512
Matrix Size Threshold: 32 x 32
Matrix Partitions : 2801
ResultSets Reflection: Enabled
Calculating...
Strassen HBC - Pass 1 - Completed: 1.12500 secs
Strassen HBC - Pass 2 - Completed: 0.65600 secs
Strassen HBC - Pass 3 - Completed: 0.64100 secs
Strassen HBC - Pass 4 - Completed: 0.65600 secs
Strassen HBC - Pass 5 - Completed: 0.65600 secs

Notes:

Strassen HBI - Strassen's Heap Based Incomplete algorithm for matrix multiplication
Strassen HBC - Strassen's Heap Based Complete algorithm for matrix multiplication

yuriisig · ‎06-22-2012

At me the positive effect of multisequencing of algorithms of fast matrix multiplication is shown on matrixes not less than 1500 * 1500: http://software.intel.com/ru-ru/forums/showthread.php?t=75835&o=a&s=lr

SergeyKostrov · ‎06-22-2012

Quoting yuriisig

At me the positive effect of multisequencing of algorithms of fast matrix multiplication is shown on matrixes not less than 1500 * 1500: http://software.intel.com/ru-ru/forums/showthread.php?t=75835&o=a&s=lr

Absolutely agree because modern CPUs arevery fast andit looks like useless to do anything else in case ofmultiplication of
small matrices. Thank you for the link and I'll take a look.

AStrassen HBC algorithm which I used for comparisonis a one thread algorithm designed and tuned upfor Embedded Real-Timesystems.

Best regards,
Sergey