TBB behaving erratically

Nav · ‎01-21-2010

I tried a parallel_for program on TBB and it gave me the following results for a grainsize of 20000 and a matrix addition with matrices of 20000 elements:

Time for serial execution: 52 microsecond

Time for parallel execution: 36 microsecond

This was with gcc 4.1 which was preinstalled on my Fedora 8 system. I've got a 64 bit Intel Core2 Duo processor.

But recently, I installed gcc 4.4.2 and the trial version of Intel C++ compiler v11.1 and got these results for the same program:

Compiling with g++:

Time for serial execution: 523 microsecond

Time for parallel execution: 700 microsecond

Compiling with icpc:

Time for serial execution: 427 microsecond

Time for parallel execution: 1345 microsecond

What surprises me more than the increase in processing time, is that now, the parallel execution time is greater than the serial time.

I'll be posting the code soon. Could anyone help explain this anomaly?

Nav · ‎01-21-2010

The code is attached with this post.

Nav · ‎01-21-2010

What is even more surprising is that when I tried matrix multiplication today, the parallel_for seemed to be working fine for this code:

[bash]#include
#include "../include/tbb/task_scheduler_init.h"
#include "../include/tbb/parallel_for.h"
#include "../include/tbb/blocked_range2d.h"
#include "../include/tbb/tick_count.h"
#include
#include
#define MICROSECOND 1000000

using namespace std;
using namespace tbb;

const size_t L = 750;
const size_t M = 725;
const size_t N = 700;

class MatrixMultiplyBody2D
{
float (*my_a);
float (*my_b);
float (*my_c);
public:
void operator()( const blocked_range2d& r ) const
{
float (*a) = my_a;
float (*b) = my_b;
float (*c) = my_c;
for( size_t i=r.rows().begin(); i!=r.rows().end(); ++i )
{
for( size_t j=r.cols().begin(); j!=r.cols().end();++j )
{
float sum = 0;
for( size_t k=0; k*b;
c = sum;
}
}
}
MatrixMultiplyBody2D( float c, float a, float b ) : my_a(a), my_b(b), my_c(c) {}
};//end of class

void ParallelMatrixMultiply_withTBB(float c, float a, float b)
{
parallel_for( blocked_range2d(0, M, 16, 0, N, 32),MatrixMultiplyBody2D(c,a,b) );
}//ParallelMatrixMultiply_withTBB

void ParallelMatrixMultiply_withoutTBB(float c, float a, float b)
{
for( size_t i=0; i*b;
c = sum;
}
}

}//ParallelMatrixMultiply_withoutTBB

int main()
{
float a,b,c,d;
struct timeval start, end;
long seconds,useconds;
//task_scheduler_init tbb_init;

for(int i=0;i=j;
for(int i=0;i=j;

tick_count t0=tick_count::now();
gettimeofday(&start,NULL);
ParallelMatrixMultiply_withTBB(c,a,b);//----------------------------------------------------------TBB
gettimeofday(&end,NULL);
tick_count t1=tick_count::now();
seconds  = end.tv_sec  - start.tv_sec;
useconds = end.tv_usec - start.tv_usec;
printf("TBB: work took %ld second and %ld microsecondn",seconds,useconds);

t0=tick_count::now();
gettimeofday(&start,NULL);
ParallelMatrixMultiply_withoutTBB(d,a,b);//-------------------------------------------------------Serial
gettimeofday(&end,NULL);
t1=tick_count::now();
seconds  = end.tv_sec  - start.tv_sec;
useconds = end.tv_usec - start.tv_usec;
printf("Serially: work took %ld second and %ld microsecondn",seconds,useconds);
}//main[/bash]

When compiled with g++:

TBB: work took 2 second and 153167 microsecond
Serially: work took 3 second and 177209 microsecond

When compiled with icpc:

TBB: work took 1 second and 72317 microsecond
Serially: work took 0 second and 0 microsecond
How could it take zero microsecond? I checked with some code to verify if it was actually doing the calculation, and it was doing the calculation. But how in the world could it take zero microsecond?!!!

RafSchietekat · ‎01-21-2010

On Core Duo/Linux/g++ 4.2.4, and with twice the addition+substitution (TBB resp. Serially):

if(useconds<0) { seconds-=1; useconds+=1000000; }
printf("TBB/Serially: work took %ld.%06ld or %f seconds\n",seconds,useconds, (t1-t0).seconds());

I get (not optimised):

TBB: work took 1.839334 or 1.839345 seconds
Serially: work took 4.167697 or 4.167701 seconds

or (optimised):

TBB: work took 1.007371 or 1.007380 seconds
Serially: work took 1.880946 or 1.880949 seconds

So icpc optimises part of the program away? Cool! :-)

(Added) How did you check that the computation was performed at run time? Did you, e.g., print out the sum of the elements? All the elements? Try to initialise the matrices with random values?

Nav · ‎01-22-2010

"if(useconds<0) { seconds-=1; useconds+=1000000; }" - That was helpful. I had made a small calculation mistake when I tried it out.

Yes, I printed the sum's and verified two different arrays containing the sums of serial and parallel. Even with random values, the time taken is zero millisecond.

How did you do twice the addition+subtraction? When I tried running the function twice, it still gave zero millisecond.

Did you see the parallel_for zip file I've attached? Does that work correctly on your system?

RafSchietekat · ‎01-22-2010

"Yes, I printed the sum's and verified two different arrays containing the sums of serial and parallel. Even with random values, the time taken is zero millisecond."
With the same code change? Not even a microsecond? Spooky!

"How did you do twice the addition+subtraction? When I tried running the function twice, it still gave zero millisecond."
I meant that twice in the code I added (inserted) the first line and substituted (not "subtracted") the second line.

"Did you see the parallel_for zip file I've attached? Does that work correctly on your system?"
Hmm, just some random thoughts to consider (please try and then tell us): you have a lot of unused variables there (use -Wall and heed its advice), you don't reinitialise averageTime before testing parallel time, it might be better to execute the loop as shown below, the code might be simpler with tbb::tick_count and simple printf()'ing as %f, and then of course you should give parallel execution a chance to offset the parallel overhead, and consider that tasks need to do a minimum amount of work before they become (significantly) more useful than sequential execution.

[cpp]void operator() ( const blocked_range& r ) const
{
  int * l_array_a=p_array_a;
  int * l_array_b=p_array_b;
  int * l_array_sum=p_array_sum;
  for ( int i = r.begin(), i_end = r.end(); i!=i_end; i++ )
  { // iterates over the entire chunk
    l_array_sum = l_array_a + l_array_b;
  }                                
}
[/cpp]

RafSchietekat · ‎01-23-2010

My random thoughts above were a bit pedantic, but it's easier to get somebody to test something if the code is more accessible and with fewer surprises.

I had a closer look, with Core Duo/Linux/g++ 4.2.4/tbb22_20091101oss (always the debug library from TBB in these timings, though) and program arguments "20000 20000", with the following results (each time a typical result from a few repeated attempts):

With the original operator() code:

Without optimisation:
0 seconds and 129 micro-seconds Serial time
0 seconds and 225 micro-seconds Parallel time

With optimisation:
0 seconds and 77 micro-seconds Serial time
0 seconds and 71 micro-seconds Parallel time

With modified code:

Without optimisation:
0 seconds and 126 micro-seconds Serial time
0 seconds and 132 micro-seconds Parallel time

With optimisation:
0 seconds and 79 micro-seconds Serial time
0 seconds and 47 micro-seconds Parallel time

I'm surprised at the 77/71 result (yes, it is typical), because it would seem that with grainsize equalling array size everything is serial anyway (right?), so the overhead of potential parallelism is not compensated by the benefits of actual parallelism (I didn't try to modify the serial loop, so the 79/47 result has an obvious alternate explanation). I haven't looked further myself either to test results with actual parallel execution and/or varying amounts of work, though, nor have I tried to similarly optimise the serial loop.

Performance improvements like this seem to beg the question what will happen with lambda expressions, which have the desirable property of getting rid of all the clutter... only for the outcome to be marred by the need for local pointer copies for best performance?

(Added 2010-01-24) Sorry, the optimisation is obviously not relevant for the serial code. And the situation annoyingly doesn't seem to be resolved with the use of tbb::tick_count instead either.

Nav · ‎01-24-2010

"With the same code change? Not even a microsecond? Spooky!"

Yes. Do any of the TBB developers have any explanation to this?

I added another line of for(size_t k=0; k*b;} and still got zero microsecond :) It gave 1 microsecond for one of the runs though :) (glory haleluiah!)

"...you have a lot of unused variables there..."

I'm extremely sorry about the code clutter. I also realised later that averageTime was not re-initialised, but correcting it didn't affect the results much because the value was almost 0.

Thank you very much for taking the time to run the code. I'm very grateful for it.

I've tried the program for varying grainsizes and large array sizes. Parallel_for still takes more time than serial.

Same problem with OpenMP. But OpenMP's performance time is a little better than TBB. (Robert Reed, if you're reading this post, I've got my answer).

I didn't understand what you meant by -Wall. Googling for it didn't give any decent results either.

Secondly, I assume you're using something like -O1 or -O2 when you're optimising?

"give parallel execution a chance to offset the parallel overhead, and consider that tasks need to do a minimum amount of work before they become (significantly) more useful than sequential execution."

That's very true. For doing so, we need to know how much of overheads is initially produced. Is there any documentation or tutorial which can help with this?

The internal working (performance wise) of TBB is largely unavailable in detail. It'd be a welcome gift of knowledge if I could learn such details before coding in TBB. Performance matters a lot to me and squeezing out performance needs knowledge.

Nav · ‎02-16-2010

Erm...replies anyone??

RafSchietekat · ‎02-17-2010

"I didn't understand what you meant by -Wall. Googling for it didn't give any decent results either."
man g++

"Secondly, I assume you're using something like -O1 or -O2 when you're optimising?"
-O

"For doing so, we need to know how much of overheads is initially produced. Is there any documentation or tutorial which can help with this?"
Tune manually (try different grainsizes) or use auto_partitioner? It's more of an art than a science, I guess. There are some guidelines in the documentation about minimum number of processor instructions per task before it is worthwhile to spawn one, but that seems rather opaque if you're not drilling down to that level.

"The internal working (performance wise) of TBB is largely unavailable in detail. It'd be a welcome gift of knowledge if I could learn such details before coding in TBB. Performance matters a lot to me and squeezing out performance needs knowledge."
Do you mean beyond the online documentation and the book?

Nav · ‎02-18-2010

Thank you.

I haven't seen the book as yet, but the book reviews had indicated it to only be some cursory introductory information.

Actually, I've recently found a book which explains a lot about the details of parallel processing. I'm understanding things a lot better with the help of that book. Hope to understand more of the TBB source soon.