Hi Alex,

Alain_M_ · ‎12-27-2016

Hello,

When I use parallel_reduce on this simple example, I observe a huge overhead in Amplifier :

double ParallelReduceCompute(uint32_t nb_steps)
{
	double step = 1. / nb_steps;

	double pi =
		tbb::parallel_reduce(
			tbb::blocked_range<uint32_t>(0, nb_steps, 1000),
			double(0), // identity element for summation

			// Transformation : f(x) = 4 / (1 + x²)
			[=](tbb::blocked_range<uint32_t>& r, double current_sum) -> double
			{
				for (size_t i = r.begin(); i != r.end(); ++i)
				{
					double x = (i + 0.5)*step;
					current_sum += 4.0 / (1.0 + x*x);
				}

				return current_sum;
			},

			// Reduction : Sum(f(x)dx)
			std::plus<double>()
		);

	pi *= step;

	return pi;
}

I changed grainsize and/or partiioner but, I don't understand why I get such huge overhead.

I have a 8 CPU machine, and I get only a 4x speedup

Thanks in advance for your help !

Alexei_K_Intel · ‎12-27-2016

Hi Alain,

Could you attach the picture from VTune Amplifier showing huge overhead, please? What is the model of your CPU?

Regards, Alex

Alain_M_ · ‎12-27-2016

Hi Alex,

The CPU is a Core i7 code named Haswell (i7-4800MQ @2.70 GHz)

Regards.

Alain.

Alexei_K_Intel · ‎12-27-2016

Hi Alain,

Thank you for the information. VTune Amplifier shows really strange numbers. I will contact VTune Amplifier team to investigate the issue.

As for your CPU, it has 4 cores and 2 hyper-threads for each core (8 threads total). The algorithm is compute-bound and I suppose even one thread can fully utilize FPU and there is no opportunity for the second hyper thread to extract additional performance. Therefore, 4x speed up seems very good result.

Regards, Alex

Alain_M_ · ‎12-27-2016

Hi Alex,

Thanks for your answer !

Two more informations : I got a 8x speed up when I switch to Intel compiler and I just checked that this problem of overhead only occured with a x86 target, not with x64. With a x64 target, VTune give me perfect results with no overhead at all.

Regards, Alain.

Large overhead in TBB parallel_reduce algorithm