Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Intel Community
- Software
- Software Development SDKs and Libraries
- Intel® oneAPI Threading Building Blocks
- Large overhead in TBB parallel_reduce algorithm

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

Alain_M_

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

12-27-2016
01:43 AM

98 Views

Large overhead in TBB parallel_reduce algorithm

Hello,

When I use parallel_reduce on this simple example, I observe a huge overhead in Amplifier :

double ParallelReduceCompute(uint32_t nb_steps) { double step = 1. / nb_steps; double pi = tbb::parallel_reduce( tbb::blocked_range<uint32_t>(0, nb_steps, 1000), double(0), // identity element for summation // Transformation : f(x) = 4 / (1 + x²) [=](tbb::blocked_range<uint32_t>& r, double current_sum) -> double { for (size_t i = r.begin(); i != r.end(); ++i) { double x = (i + 0.5)*step; current_sum += 4.0 / (1.0 + x*x); } return current_sum; }, // Reduction : Sum(f(x)dx) std::plus<double>() ); pi *= step; return pi; }

I changed grainsize and/or partiioner but, I don't understand why I get such huge overhead.

I have a 8 CPU machine, and I get only a 4x speedup

Thanks in advance for your help !

Link Copied

4 Replies

Alexei_K_Intel

Employee

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

12-27-2016
02:42 AM

98 Views

Hi Alain,

Could you attach the picture from VTune Amplifier showing huge overhead, please? What is the model of your CPU?

Regards, Alex

Alain_M_

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

12-27-2016
02:53 AM

98 Views

Alexei_K_Intel

Employee

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

12-27-2016
03:17 AM

98 Views

Hi Alain,

Thank you for the information. VTune Amplifier shows really strange numbers. I will contact VTune Amplifier team to investigate the issue.

As for your CPU, it has 4 cores and 2 hyper-threads for each core (8 threads total). The algorithm is compute-bound and I suppose even one thread can fully utilize FPU and there is no opportunity for the second hyper thread to extract additional performance. Therefore, 4x speed up seems very good result.

Regards, Alex

Alain_M_

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

12-27-2016
04:29 AM

98 Views

Hi Alex,

Thanks for your answer !

Two more informations : I got a 8x speed up when I switch to Intel compiler and I just checked that this problem of overhead only occured with a x86 target, not with x64. With a x64 target, VTune give me perfect results with no overhead at all.

Regards, Alain.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

For more complete information about compiler optimizations, see our Optimization Notice.