- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
When I use parallel_reduce on this simple example, I observe a huge overhead in Amplifier :
double ParallelReduceCompute(uint32_t nb_steps) { double step = 1. / nb_steps; double pi = tbb::parallel_reduce( tbb::blocked_range<uint32_t>(0, nb_steps, 1000), double(0), // identity element for summation // Transformation : f(x) = 4 / (1 + x²) [=](tbb::blocked_range<uint32_t>& r, double current_sum) -> double { for (size_t i = r.begin(); i != r.end(); ++i) { double x = (i + 0.5)*step; current_sum += 4.0 / (1.0 + x*x); } return current_sum; }, // Reduction : Sum(f(x)dx) std::plus<double>() ); pi *= step; return pi; }
I changed grainsize and/or partiioner but, I don't understand why I get such huge overhead.
I have a 8 CPU machine, and I get only a 4x speedup
Thanks in advance for your help !
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Alain,
Could you attach the picture from VTune Amplifier showing huge overhead, please? What is the model of your CPU?
Regards, Alex
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Alex,
The CPU is a Core i7 code named Haswell (i7-4800MQ @2.70 GHz)
Regards.
Alain.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Alain,
Thank you for the information. VTune Amplifier shows really strange numbers. I will contact VTune Amplifier team to investigate the issue.
As for your CPU, it has 4 cores and 2 hyper-threads for each core (8 threads total). The algorithm is compute-bound and I suppose even one thread can fully utilize FPU and there is no opportunity for the second hyper thread to extract additional performance. Therefore, 4x speed up seems very good result.
Regards, Alex
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Alex,
Thanks for your answer !
Two more informations : I got a 8x speed up when I switch to Intel compiler and I just checked that this problem of overhead only occured with a x86 target, not with x64. With a x64 target, VTune give me perfect results with no overhead at all.
Regards, Alain.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page