Intel® oneAPI Threading Building Blocks
Ask questions and share information about adding parallelism to your applications when using this threading library.

Large overhead in TBB parallel_reduce algorithm

Alain_M_
Beginner
510 Views

Hello,

When I use parallel_reduce on this simple example, I observe a huge overhead in Amplifier :

double ParallelReduceCompute(uint32_t nb_steps)
{
	double step = 1. / nb_steps;

	double pi =
		tbb::parallel_reduce(
			tbb::blocked_range<uint32_t>(0, nb_steps, 1000),
			double(0), // identity element for summation

			// Transformation : f(x) = 4 / (1 + x²)
			[=](tbb::blocked_range<uint32_t>& r, double current_sum) -> double
			{
				for (size_t i = r.begin(); i != r.end(); ++i)
				{
					double x = (i + 0.5)*step;
					current_sum += 4.0 / (1.0 + x*x);
				}

				return current_sum;
			},

			// Reduction : Sum(f(x)dx)
			std::plus<double>()
		);

	pi *= step;

	return pi;
}

I changed grainsize and/or partiioner but, I don't understand why I get such huge overhead.

I have a 8 CPU machine, and I get only a 4x speedup

Thanks in advance for your help !

0 Kudos
4 Replies
Alexei_K_Intel
Employee
510 Views

Hi Alain,

Could you attach the picture from VTune Amplifier showing huge overhead, please? What is the model of your CPU?

Regards, Alex

0 Kudos
Alain_M_
Beginner
510 Views

Hi Alex,

The CPU is a Core i7 code named Haswell (i7-4800MQ @2.70 GHz)

Regards.

Alain.

0 Kudos
Alexei_K_Intel
Employee
510 Views

Hi Alain,

Thank you for the information. VTune Amplifier shows really strange numbers. I will contact VTune Amplifier team to investigate the issue.

As for your CPU, it has 4 cores and 2 hyper-threads for each core (8 threads total). The algorithm is compute-bound and I suppose even one thread can fully utilize FPU and there is no opportunity for the second hyper thread to extract additional performance. Therefore, 4x speed up seems very good result.

Regards, Alex

0 Kudos
Alain_M_
Beginner
510 Views

Hi Alex,

Thanks for your answer !

Two more informations : I got a 8x speed up when I switch to Intel compiler and I just checked that this problem of overhead only occured with a x86 target, not with x64. With a x64 target, VTune give me perfect results with no overhead at all.

Regards, Alain.

0 Kudos
Reply