Solved: parallel_reduce is much slower than the reduction in OpenMP

afd_lml · ‎07-04-2010

Hi, every one.

I want to calculate PI using multi-core parallel algorithms. The following is my code. The first part is written with TBB's parallel_reduce, and the second part with OpenMP's reduction. Although both will give the correct answer 3.141516, Ifound that thereduction is much faster than the parallel_reduce. For example, on my Intel i930 PC (with 4 cores), the TBB's parallel_reduce needs 1.9 seconds, while the OpenMP's reduction requires 0.98 seconds. I can not understand this problem.Would anyone like to give some advice ?

#include
#include
#include "tbb/tbb.h"
using namespace std;
using namespace tbb;

const int num_steps = 1000000000;
const double step = 1.0/num_steps;
double pi = 0.0;

class CMyPi
{
public:
double sum;

CMyPi() : sum(0.0) {}
void operator() (const blocked_range& r)
{
for(int i = r.begin();i!=r.end();++i)
{
double x = (i+0.5)*step;
sum += 4.0/(1.0 + x*x);
}
}

CMyPi(CMyPi& x, split) : sum(0.0) {}
void join(const CMyPi& y) { sum += y.sum; }
};

int main()
{
clock_t start, stop;

CMyPi myPi;
start = clock();
parallel_reduce(blocked_range(0, num_steps), myPi);
pi = step * myPi.sum;
stop = clock();
//cout << "The value of PI is " << pi << endl;
cout << "The time to calculate PI was " << (double)(stop-start)/CLOCKS_PER_SEC << " seconds\\n\\n";

start = clock();
double sum = 0.0;
#pragma omp parallel for reduction(+:sum)
for (int i=0; i {
double x = (i+0.5)*step;
sum += 4.0/(1.0 + x*x);
}
pi = step*sum;
stop = clock();

//cout << "The value of PI is " << pi << endl;
cout << "The time to calculate PI was " << (double)(stop-start)/CLOCKS_PER_SEC << " seconds\\n";

return 0;
}

Alexey-Kukanov · ‎07-04-2010

It should be something like that (I assume you have an idea about C++0x lambda functions):

[cpp]   start = clock();
   pi = parallel_reduce(
            blocked_range(0, num_steps),
            double(0), // identity element for summation
            [&]( blocked_range& r, double current_sum ) -> double {
                    for (size_t i=r.begin(); i!=r.end(); ++i) {
                        double x = (i+0.5)*step;  
                        current_sum += 4.0/(1.0 + x*x); 
                    }
                    return current_sum; // body returns updated value of the accumulator
            },
            []( double s1, double s2 ) {
                    return s1+s2;       // "joins" two accumulated values
            }
   );
   pi *= step;
   stop = clock();
[/cpp]

Note a few things:
-This form of parallel_reduce returns a value;
- The second argument provides parallel_reduce with an identity element to initialize new accumulators. It also defines the type of the accumulators and returned value. So it is important to use proper type here; I remember how I typed a similar loop during a public demo, and made that mistakeof using "plain"0 for identity, which of course is an integer while I needed a double.
- The third and fourth arguments of parallel_reduce are lambda functions; but "regular" functors can also be used there.
- The main body functor still takes blocked_range, but it also takes the current value of an accumulator in the second argument. Note that it should also return a value; this valuewill be assigned to the accumulator overriding its old value. Thus it is important to add to the given value of the accumulator, and return the result.
- Conveniently, the accumulator argument is a local variable friendly to compiler optimizations; so you don't need any special variable to "help" the compiler.
- The fourth argument of parallel_reduce is the functor to combine (reduce) two accumulators; it takes their values and returns the result of reduction. It serves the samepurpose as method join() in the original form of parallel_reduce, but only does calculations; "joining" of one result into another has become an implementation detail.

View solution in original post

renorm · ‎07-04-2010

On my dual core platform OpenMP was slightly slower than TBB. Try to use large grain size with simple partitioner.

Alexey-Kukanov · ‎07-04-2010

The problem is most likely due to vectorization not made by compiler. Read herefor details, and change your code to help the compiler.You may also try out lambda functions and the corresponding version of parallel_reduce.

afd_lml · ‎07-04-2010

Thany you for your reply. Your post http://software.intel.com/en-us/blogs/2008/03/04/why-a-simple-test-can-get-parallel-slowdown/is very useful. The problem of my code is caused by without introducing alocal variables tohelp the compiler.

However, I really do not know how to use lambda funcitons in the parallel_reduce. Would you or anyone else like to help me on this issue ?

Alexey-Kukanov · ‎07-04-2010

It should be something like that (I assume you have an idea about C++0x lambda functions):

[cpp]   start = clock();
   pi = parallel_reduce(
            blocked_range(0, num_steps),
            double(0), // identity element for summation
            [&]( blocked_range& r, double current_sum ) -> double {
                    for (size_t i=r.begin(); i!=r.end(); ++i) {
                        double x = (i+0.5)*step;  
                        current_sum += 4.0/(1.0 + x*x); 
                    }
                    return current_sum; // body returns updated value of the accumulator
            },
            []( double s1, double s2 ) {
                    return s1+s2;       // "joins" two accumulated values
            }
   );
   pi *= step;
   stop = clock();
[/cpp]

Note a few things:
-This form of parallel_reduce returns a value;
- The second argument provides parallel_reduce with an identity element to initialize new accumulators. It also defines the type of the accumulators and returned value. So it is important to use proper type here; I remember how I typed a similar loop during a public demo, and made that mistakeof using "plain"0 for identity, which of course is an integer while I needed a double.
- The third and fourth arguments of parallel_reduce are lambda functions; but "regular" functors can also be used there.
- The main body functor still takes blocked_range, but it also takes the current value of an accumulator in the second argument. Note that it should also return a value; this valuewill be assigned to the accumulator overriding its old value. Thus it is important to add to the given value of the accumulator, and return the result.
- Conveniently, the accumulator argument is a local variable friendly to compiler optimizations; so you don't need any special variable to "help" the compiler.
- The fourth argument of parallel_reduce is the functor to combine (reduce) two accumulators; it takes their values and returns the result of reduction. It serves the samepurpose as method join() in the original form of parallel_reduce, but only does calculations; "joining" of one result into another has become an implementation detail.

afd_lml · ‎07-04-2010

Thank you very much for your help !

I know TBB has released many lambda-stylefunctions (classes), butIt seems that there arefew introductions about how to use these new lambda-style functions. Isuggest the TBB team should give more information or instructions.

ARCH_R_Intel · ‎07-06-2010

One other handy hint: For scalar reductions, oftenthe 4th argument can be afunctor from the standard header . For example, the 4th argument in Alexey's example can be std:plus().