By using TBB, speed(performance) is decreased. why ?

lovebestintel · ‎01-30-2011

For multicore programming by TBB, i tested the following code under the this environment (TBB 2.1, Intel C++ Compiler 11.x, Visual Studio 2008, Intel Core 2 Quad 8200, Windows XP)

[CODE C]
#include
#include tbb/parallel_for.h
#include tbb/blocked_range2d.h
#include tbb/task_scheduler_init.h
#include tbb/tick_count.h
#include tbb/partitioner.h

using namespace tbb;

const size_t L = 200;
const size_t M = 200;
const size_t N = 200;

void SerialMatrtixMultiply(float c[], float a[], float b[]){
for(size_t i = 0; i < M; ++i){
for(size_t j = 0; j < N; ++j){
float sum = 0;
for(size_t k = 0; k < L; ++k)
sum += a * b;
c = sum;
}
}
}

class MatrixMultiply2D{
float (*my_a);
float (*my_b);
float (*my_c);

public:
void operator()(const blocked_range2d& r) const {
float (*a) = my_a;
float (*b) = my_b;
float (*c) = my_c;

for(size_t i = r.rows().begin(); i != r.rows().end(); ++i){
for(size_t j = r.cols().begin(); j != r.cols().end(); ++j){
float sum = 0;
for(size_t k = 0; k < L; ++k)
sum += a * b;
c = sum;
}
}
}
MatrixMultiply2D(float c[], float a[], float b[]):my_a(a), my_b(b), my_c(c)
{}
};

void ParallelMatrixMultiply(float c[], float a[], float b[]){
parallel_for(blocked_range2d(0, M, 0, N), MatrixMultiply2D(c,a,b), auto_partitioner());
}

int main(void){
task_scheduler_init init;

float a;
float b;
float c;

srand(time(NULL));
for(int i = 0;i < M;i++){
for(int j = 0;j < L;j++)
a = rand() % 30;
}

for(int i = 0;i < L;i++){
for(int j = 0;j < N;j++)
b = rand() % 30;
}

tick_count t0 = tick_count::now();
SerialMatrtixMultiply(c,a,b);
tick_count t1 = tick_count::now();
std::cout << seq eslaped time : << (t1 t0).seconds() << std::endl;

t0 = tick_count::now();
ParallelMatrixMultiply(c,a,b);
t1 = tick_count::now();
std::cout << parallel eslaped time : << (t1 t0).seconds() << std::endl;

return 0;
}

The elapsed time :

seq elapsed time : 0.04437542 (s)
parallel elapsed time : 0.0111000 (s)

Parallel case is faster than sequential case.

But, when i removed "task_scheduler_init init;" and "ParallelMatrixMultiply(c,a,b);" , namely only running SerialMatrtixMultiply(c,a,b), elapsed time is 0.0000636 (s).

This is very much faster than TBB case.

Why appears this strange?

Any answer will be appricated!

timintel · ‎02-02-2011

I guess the compiler has been able to eliminate the loops which produce unused results.