Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Intel Community
- Software
- Software Development SDKs and Libraries
- Intel® oneAPI Threading Building Blocks
- parallel_reduce is much slower than the reduction in OpenMP

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

afd_lml

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

07-04-2010
12:59 AM

171 Views

parallel_reduce is much slower than the reduction in OpenMP

I want to calculate PI using multi-core parallel algorithms. The following is my code. The first part is written with TBB's parallel_reduce, and the second part with OpenMP's reduction. Although both will give the correct answer 3.141516,

#include

#include

#include "tbb/tbb.h"

using namespace std;

using namespace tbb;

const int num_steps = 1000000000;

const double step = 1.0/num_steps;

double pi = 0.0;

class CMyPi

{

public:

double sum;

CMyPi() : sum(0.0) {}

void operator() (const blocked_range

{

for(int i = r.begin();i!=r.end();++i)

{

double x = (i+0.5)*step;

sum += 4.0/(1.0 + x*x);

}

}

CMyPi(CMyPi& x, split) : sum(0.0) {}

void join(const CMyPi& y) { sum += y.sum; }

};

int main()

{

clock_t start, stop;

CMyPi myPi;

start = clock();

parallel_reduce(blocked_range

pi = step * myPi.sum;

stop = clock();

//cout << "The value of PI is " << pi << endl;

cout << "The time to calculate PI was " << (double)(stop-start)/CLOCKS_PER_SEC << " seconds\\n\\n";

start = clock();

double sum = 0.0;

#pragma omp parallel for reduction(+:sum)

for (int i=0; i

double x = (i+0.5)*step;

sum += 4.0/(1.0 + x*x);

}

pi = step*sum;

stop = clock();

//cout << "The value of PI is " << pi << endl;

cout << "The time to calculate PI was " << (double)(stop-start)/CLOCKS_PER_SEC << " seconds\\n";

return 0;

}

1 Solution

Alexey_K_Intel3

Employee

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

07-04-2010
01:28 PM

171 Views

It should be something like that (I assume you have an idea about C++0x lambda functions):

[cpp] start = clock(); pi = parallel_reduce( blocked_range(0, num_steps), double(0), // identity element for summation [&]( blocked_range & r, double current_sum ) -> double { for (size_t i=r.begin(); i!=r.end(); ++i) { double x = (i+0.5)*step; current_sum += 4.0/(1.0 + x*x); } return current_sum; // body returns updated value of the accumulator }, []( double s1, double s2 ) { return s1+s2; // "joins" two accumulated values } ); pi *= step; stop = clock(); [/cpp]

Note a few things:

-This form of parallel_reduce returns a value;

- The second argument provides parallel_reduce with an identity element to initialize new accumulators. It also defines the type of the accumulators and returned value. So it is important to use proper type here; I remember how I typed a similar loop during a public demo, and made that mistakeof using "plain"0 for identity, which of course is an integer while I needed a double.

- The third and fourth arguments of parallel_reduce are lambda functions; but "regular" functors can also be used there.

- The main body functor still takes blocked_range, but it also takes the current value of an accumulator in the second argument. Note that it should also return a value; this valuewill be **assigned **to the accumulator overriding its old value. Thus it is important to add to the given value of the accumulator, and return the result.

- Conveniently, the accumulator argument is a local variable friendly to compiler optimizations; so you don't need any special variable to "help" the compiler.

- The fourth argument of parallel_reduce is the functor to combine (reduce) two accumulators; it takes their values and returns the result of reduction. It serves the samepurpose as method join() in the original form of parallel_reduce, but only does calculations; "joining" of one result into another has become an implementation detail.

Link Copied

6 Replies

renorm

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

07-04-2010
02:57 AM

171 Views

Alexey_K_Intel3

Employee

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

07-04-2010
04:07 AM

171 Views

afd_lml

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

07-04-2010
06:08 AM

171 Views

However, I really do not know how to use lambda funcitons in the parallel_reduce. Would you or anyone else like to help me on this issue ?

Alexey_K_Intel3

Employee

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

07-04-2010
01:28 PM

172 Views

It should be something like that (I assume you have an idea about C++0x lambda functions):

[cpp] start = clock(); pi = parallel_reduce( blocked_range(0, num_steps), double(0), // identity element for summation [&]( blocked_range & r, double current_sum ) -> double { for (size_t i=r.begin(); i!=r.end(); ++i) { double x = (i+0.5)*step; current_sum += 4.0/(1.0 + x*x); } return current_sum; // body returns updated value of the accumulator }, []( double s1, double s2 ) { return s1+s2; // "joins" two accumulated values } ); pi *= step; stop = clock(); [/cpp]

Note a few things:

-This form of parallel_reduce returns a value;

- The second argument provides parallel_reduce with an identity element to initialize new accumulators. It also defines the type of the accumulators and returned value. So it is important to use proper type here; I remember how I typed a similar loop during a public demo, and made that mistakeof using "plain"0 for identity, which of course is an integer while I needed a double.

- The third and fourth arguments of parallel_reduce are lambda functions; but "regular" functors can also be used there.

- The main body functor still takes blocked_range, but it also takes the current value of an accumulator in the second argument. Note that it should also return a value; this valuewill be **assigned **to the accumulator overriding its old value. Thus it is important to add to the given value of the accumulator, and return the result.

- Conveniently, the accumulator argument is a local variable friendly to compiler optimizations; so you don't need any special variable to "help" the compiler.

- The fourth argument of parallel_reduce is the functor to combine (reduce) two accumulators; it takes their values and returns the result of reduction. It serves the samepurpose as method join() in the original form of parallel_reduce, but only does calculations; "joining" of one result into another has become an implementation detail.

afd_lml

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

07-04-2010
05:37 PM

171 Views

I know TBB has released many lambda-stylefunctions (classes), butIt seems that there arefew introductions about how to use these new lambda-style functions. Isuggest the TBB team should give more information or instructions.

ARCH_R_Intel

Employee

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

07-06-2010
09:23 AM

171 Views

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

For more complete information about compiler optimizations, see our Optimization Notice.