Mostl likely, the compiler is able to vectorize the inner (serial) loop in the first case, but cannot vectorize its equivalent in TBB internals in the second case. Something for us TBB developers to look at.
Yes, probably. It turns out that the performance difference is evident when compiling with -O3. When using -O2, the performance is the same for both loop versions (actually, the performance of the "compact" version does not improve at all when going from O2 to O3). Good to know that you are looking at it.
You could confirm Alexey's very plausible analysis, and get intermediate relief, by going into include/tbb/parallel_for.h and hoisting r.end() out of the loop in parallel_for_body::operator(), i.e., assign its value to a variable i_end and evaluate i
(Added) Hmm, better also make a local copy of my_step...