Intel tbb flowgraph speedup


Here is my attempt to benchmark the performance of intel tbb flow graph. Here is the setup:

- One broadcast node sending continue_msg to N successor nodes (broadcast_node<continue_msg>)

- Each successor node perform a computation that takes t seconds.

- The total computation time when performed serially is Tserial = N* t

- The ideal computation time if all cores are used is Tpar(ideal) = N * t / C, where C is the number of cores.

- The speedup is defined as Tpar(actual) / Tserial

- I tested the code with gcc5 on a 16 core PC.

Here are the results showing the speedup as a function of the processing time of individually task (i.e. body):

t = 100 microsecond  ,    speed-up =  14

t  = 10 microsecond  ,    speed-up =  7

t  = 1 microsecond  ,   speed-up =  1

As can been for light weight tasks (whose computation takes less than 1 microseconds), the parallel code is actually slower that the serial code. Here are my questions:

  1. Are these results inline with intel tbb benchmarks?

  2. It there a better paradigm than flow graph for the case when there are thousands of tasks each taking less than 1 microsecond?

