#1 "If you use tbb::parallel_for( ... ), with a reasonably big
grainsize (100000?), you should be able to see a significant speed-up
compared to the STL algorithm."
Is that an empirical result? The number usually tossed around is about 10000 instructions, with grainsize a fraction of that.
The manual says that best performance are achieved when a grainsize between 10k and 100k is used. From my personal experience, whether the best number is closer to one end or another, depends on how complex the "kernel" is (the computation per sample). In this case, a simple addition is an easy job for any modern processor, so I will definitely opt for the bigger end of this interval. However, I usually use the auto_partitioner anyway.