Intel tbb flowgraph speedup

mmasdfasdf · ‎12-30-2017

Here is my attempt to benchmark the performance of intel tbb flow graph. Here is the setup:

- One broadcast node sending continue_msg to N successor nodes (broadcast_node<continue_msg>)

- Each successor node perform a computation that takes t seconds.

- The total computation time when performed serially is Tserial = N* t

- The ideal computation time if all cores are used is Tpar(ideal) = N * t / C, where C is the number of cores.

- The speedup is defined as Tpar(actual) / Tserial

- I tested the code with gcc5 on a 16 core PC.

Here are the results showing the speedup as a function of the processing time of individually task (i.e. body):

t = 100 microsecond , speed-up = 14

t = 10 microsecond , speed-up = 7

t = 1 microsecond , speed-up = 1

As can been for light weight tasks (whose computation takes less than 1 microseconds), the parallel code is actually slower that the serial code. Here are my questions:

Are these results inline with intel tbb benchmarks?
It there a better paradigm than flow graph for the case when there are thousands of tasks each taking less than 1 microsecond?