Here is my attempt to benchmark the performance of intel tbb flow graph. Here is the setup:
- One broadcast node sending continue_msg to N successor nodes (broadcast_node<continue_msg>)
- Each successor node perform a computation that takes t seconds.
- The total computation time when performed serially is Tserial = N* t
- The ideal computation time if all cores are used is Tpar(ideal) = N * t / C, where C is the number of cores.
- The speedup is defined as Tpar(actual) / Tserial
- I tested the code with gcc5 on a 16 core PC.
Here are the results showing the speedup as a function of the processing time of individually task (i.e. body):
t = 100 microsecond , speed-up = 14
t = 10 microsecond , speed-up = 7
t = 1 microsecond , speed-up = 1
As can been for light weight tasks (whose computation takes less than 1 microseconds), the parallel code is actually slower that the serial code. Here are my questions:
Are these results inline with intel tbb benchmarks?
It there a better paradigm than flow graph for the case when there are thousands of tasks each taking less than 1 microsecond?
For more complete information about compiler optimizations, see our Optimization Notice.