Intel® oneAPI Threading Building Blocks
Ask questions and share information about adding parallelism to your applications when using this threading library.
2464 Discussions

parallel algorithm is no faster than serial...

lancekimbrough
Beginner
655 Views

I wrote a program to calculate every factorial of i from i=0 to n, but the parallel portion of the program runs no faster (sometimes slower), than the serial version. This is still the first program I've written using TBB (and the first time using a template library), so I was hoping someone might see something obvious that would cause this behavior. (I'm using Linux and I've compiled using intel c++ and gcc, with the same results).

[cpp]#include 
#include
#include
#include
#include "tbb/task_scheduler_init.h"
#include "tbb/blocked_range.h"
#include "tbb/parallel_for.h"


using namespace tbb;
using namespace std;


class Factorial {
public:
mpz_class *output;
void operator()( const blocked_range& range ) const {
for( int i=range.begin(); i mpz_fac_ui(output.get_mpz_t(), i);
}
}
};

void ParallelFact(mpz_class *output, size_t n) {
Factorial fact;
fact.output=output;
parallel_for( blocked_range(0, n, 1000 ), fact );
}


int main() {
/* ---------------------- begin parallel test --------------------*/

task_scheduler_init init;

int siz=9000;

mpz_class *result = new mpz_class[siz];

clock_t start_parallel = clock();
ParallelFact(result, siz);
clock_t end_parallel = clock();
double time_elapsed_parallel = double(end_parallel-start_parallel)/CLOCKS_PER_SEC;


/*------------------------- begin serial test -------------------*/

mpz_class *result_serial = new mpz_class[siz];

clock_t start_serial = clock();

for (int i=0; i mpz_fac_ui(result_serial.get_mpz_t(), i);
}

clock_t end_serial = clock();
double time_elapsed_serial = double(end_serial-start_serial)/CLOCKS_PER_SEC;

cout << "time to run in parallel: " << time_elapsed_parallel << "n";
cout << "time to run in serial: " << time_elapsed_serial <<"n";

double speed_increase=time_elapsed_serial/time_elapsed_parallel;
cout << "speed increase: " << speed_increase <<"n";

return 0;
}

[/cpp]

When I run the code using: time ./prog.out

time to run in parallel: 6.16
time to run in serial: 6.09
speed increase: 0.988636

real 0m9.791s
user 0m12.112s
sys 0m0.183s

It looks like the parallel section of the code is taking half as long to execute ("real" time is close to = 6 + 6 / 2), but for some reason the run time for the parallel is still taking 6 seconds. I've tried a huge range of grain sizes, and a large range of numbers to test up to, but I always get similar results. (BTW, I've run the sample codes, like tachyon, and I always get a speedup of 2, so I'm certain it's not the computer). Anyone have any ideas? I appreciate the help.

Best Regards,

Lance

0 Kudos
4 Replies
robert_jay_gould
Beginner
655 Views
Quoting - lance.kimbrough

I wrote a program to calculate every factorial of i from i=0 to n, but the parallel portion of the program runs no faster (sometimes slower), than the serial version. This is still the first program I've written using TBB (and the first time using a template library), so I was hoping someone might see something obvious that would cause this behavior. (I'm using Linux and I've compiled using intel c++ and gcc, with the same results).

[cpp]#include 
#include
#include
#include
#include "tbb/task_scheduler_init.h"
#include "tbb/blocked_range.h"
#include "tbb/parallel_for.h"


using namespace tbb;
using namespace std;


class Factorial {
public:
mpz_class *output;
void operator()( const blocked_range& range ) const {
for( int i=range.begin(); i mpz_fac_ui(output.get_mpz_t(), i);
}
}
};

void ParallelFact(mpz_class *output, size_t n) {
Factorial fact;
fact.output=output;
parallel_for( blocked_range(0, n, 1000 ), fact );
}


int main() {
/* ---------------------- begin parallel test --------------------*/

task_scheduler_init init;

int siz=9000;

mpz_class *result = new mpz_class[siz];

clock_t start_parallel = clock();
ParallelFact(result, siz);
clock_t end_parallel = clock();
double time_elapsed_parallel = double(end_parallel-start_parallel)/CLOCKS_PER_SEC;


/*------------------------- begin serial test -------------------*/

mpz_class *result_serial = new mpz_class[siz];

clock_t start_serial = clock();

for (int i=0; i mpz_fac_ui(result_serial.get_mpz_t(), i);
}

clock_t end_serial = clock();
double time_elapsed_serial = double(end_serial-start_serial)/CLOCKS_PER_SEC;

cout << "time to run in parallel: " << time_elapsed_parallel << "n";
cout << "time to run in serial: " << time_elapsed_serial <<"n";

double speed_increase=time_elapsed_serial/time_elapsed_parallel;
cout << "speed increase: " << speed_increase <<"n";

return 0;
}

[/cpp]

When I run the code using: time ./prog.out

time to run in parallel: 6.16
time to run in serial: 6.09
speed increase: 0.988636

real 0m9.791s
user 0m12.112s
sys 0m0.183s

It looks like the parallel section of the code is taking half as long to execute ("real" time is close to = 6 + 6 / 2), but for some reason the run time for the parallel is still taking 6 seconds. I've tried a huge range of grain sizes, and a large range of numbers to test up to, but I always get similar results. (BTW, I've run the sample codes, like tachyon, and I always get a speedup of 2, so I'm certain it's not the computer). Anyone have any ideas? I appreciate the help.

Best Regards,

Lance

This is the inherent problem with parallel programming, that is a good lesson to learn early on. Parallel processing has a large overhead, especially for little things, but when things get big parallel is the only scalable way to go.

As an example imagine you need to write a program tocalculatefactorials :) Getting a manager,accountantand a group of 40 programmers is ridiculous. Its better you do this during your lunch break and be over with it.

Now imagine you arewritinga program that needs to preprocess, process and post process thousands of frames of 3-D data to make a movie about Garbage Collecting Robots or something unlikely like that. Then the overhead of setting up a whole team to work the problem is worth it and you'll hopefully be able to get the product out the door within a few years. Now if you did this on your own, well you'd probably die of old age before you have a decent product.

So yes your factorial algorithm won't beat serial processing, I even think there is even an example ofcalculatingFibonaccinumbers in tbb's example and it mentions in the code that this is in reality a bad approach to solving the problem, as a serial implementation is faster, but they included it for educational purposes only.

0 Kudos
lancekimbrough
Beginner
655 Views

Well, I kept having a funny feeling about the "real" time being so much less than the user time, so I got on my 8 core server and tried out the same program. Sure enough, just as I started it, all 8 cores went to 100%, and then nearly instantly went back to 0%. However the time reported from clock() was about 8 times longer than it actually took (according to my watch). So, I searched around a bit and learned about a neat function called "get time of day." I incorporated it into the code, which is now....

[cpp]#include 
#include
#include
#include
#include
#include "tbb/task_scheduler_init.h"
#include "tbb/blocked_range.h"
#include "tbb/parallel_for.h"


using namespace tbb;
using namespace std;


class factorial {
public:
void operator()( const blocked_range& range ) const {
mpz_class fact_result;

for( int i=range.begin(); i mpz_fac_ui(fact_result.get_mpz_t(), i);
}
}
};


void ParallelFactorial(int n) {
factorial fact;
parallel_for(blocked_range(0, n, 1000 ), fact );
}


int main() {
task_scheduler_init init;

// Max number to apply factorial
int b=30000;

timeval time_tod;
gettimeofday(&time_tod, NULL); // get start time of day
double t1_p=time_tod.tv_sec+(time_tod.tv_usec/1000000.0);
clock_t start = clock(); // get clock start time

ParallelFactorial(b);

gettimeofday(&time_tod, NULL);
double t2_p=time_tod.tv_sec+(time_tod.tv_usec/1000000.0);
clock_t end = clock();

double time_elapsed = double(end-start)/CLOCKS_PER_SEC; // elapsed time clock
double time_elapsed_tod = t2_p-t1_p; // elapsed time tod
cout << "time to run in parallel from clock: " << time_elapsed <<"n";
cout << "time to run in parallel from time of day: " << time_elapsed_tod <<"nn";


/*-----------------------------begin serial test-------------------------------------*/

gettimeofday(&time_tod, NULL);
double t1_s=time_tod.tv_sec+(time_tod.tv_usec/1000000.0);
clock_t start_serial = clock();

mpz_class fact_result_serial;
for (int i=0; i mpz_fac_ui(fact_result_serial.get_mpz_t(), i);
}

gettimeofday(&time_tod, NULL);
double t2_s=time_tod.tv_sec+(time_tod.tv_usec/1000000.0);
clock_t end_serial = clock();

double time_elapsed_serial = double(end_serial-start_serial)/CLOCKS_PER_SEC;
double time_elapsed_tod_serial = t2_s-t1_s;


cout << "time to run in serial: " << time_elapsed_serial <<"n";
cout << "time to run in serial from time of day: " << time_elapsed_tod_serial << "n";

double time_speedup=time_elapsed_serial/time_elapsed;
double time_speedup_tod=time_elapsed_tod_serial/time_elapsed_tod;

cout << "time speedup clock: " << time_speedup <<"n";
cout << "time speedup time of day: " << time_speedup_tod << "n";

return 0;
}[/cpp]

which now returns (on 8 cores)...

time to run in parallel from clock: 140.16
time to run in parallel from time of day: 21.2837

time to run in serial: 136.84
time to run in serial from time of day: 136.854
time speedup clock: 0.976313
time speedup time of day: 6.42998

real 2m38.142s
user 4m36.000s
sys 0m0.928s

So, it looks like it's actually speeding up quite a bit! Note the gettimeofday returns a speedup of 6.4, while the clock speedup returns 0.9763. I'm guessing it has something to do with actual clock cycles. I suppose the gettimeofday is reporting closer to the actual "human time elapsed" that it took to compute, while the clock() is reporting from a "work done" point of view. I'll keeping messing around with the grainsize to see if I can get it any better. Thanks for the input!

Best Regards,

Lance

0 Kudos
lancekimbrough
Beginner
655 Views

This algorithm seems to like very low grainsize values. When I use a grainsize of 10, and b = 20,000, I get a very nice speedup of almost 7.89...

time to run in parallel from clock: 45.15
time to run in parallel from time of day: 5.6811

time to run in serial: 44.83
time to run in serial from time of day: 44.8365

time speedup clock: 0.992913
time speedup time of day: 7.89223

real 0m50.522s
user 1m29.792s
sys 0m0.092s

Very cool! I'm really starting to like TBB! :)

0 Kudos
robert-reed
Valued Contributor II
655 Views
Quoting - lance.kimbrough

This algorithm seems to like very low grainsize values. When I use a grainsize of 10, and b = 20,000, I get a very nice speedup of almost 7.89...

time to run in parallel from clock: 45.15
time to run in parallel from time of day: 5.6811

time to run in serial: 44.83
time to run in serial from time of day: 44.8365

time speedup clock: 0.992913
time speedup time of day: 7.89223

real 0m50.522s
user 1m29.792s
sys 0m0.092s

Very cool! I'm really starting to like TBB! :)

Here's a couple other things to think about. Why is such a low grainsize effective?

[cpp]for(inti=range.begin();i

For each value of i, this code computes the complete factorial value, so as the i-value gets higher, the number of multiplies required also increases. That is, this is an unbalanced workload--it takes a lot more work on the upper end of the range than it does on the lower end. Small grain sizes improve work packing among the threads, thus achieving better loadbalance.

One more: it seems you invested a lot of time in coming up with code to time the computation. Did you ever give tbb::tick_count class a try? I don't know how the results would compare with what you tried, but the implementation could be as simple as:

[cpp] tick_count t0 = tick_count::now();
ParallelFactorial(b);
tick_count t1 = tick_count::now();
cout << "time to run in parallel: " << (t1-t0).seconds() << endl;
[/cpp]
0 Kudos
Reply