I had another look, and hopefully my understanding is a bit better now.
First, let's agree to call (only) a tbb::task a task (well, also a sequentially executed part of the total work not involving TBB), and not use the word "job" anymore (I believe that you were using "job" for "task", which is confusing). To keep it simple, task overhead should be measured for a large-enough total amount of work, to amortise startup (where workers take a while to get going) and imperfect parallel slack (where workers don't finish at exactly the same time). Here it is necessary to distinguish between individual task overhead and aggregate task overhead (to include all the descendants).
Tasks can be spawned more efficiently by using task_list, but this is not scalable (one thread does all the dividing, only one task is stolen at a time), and while this may be associated with smaller tasks, it is not an obvious correlation by far. The recommendation is for recursive parallelism, where the division of work is also parallelised and therefore scalable (for both dividing and executing), even if this means that the first divisions have a high individual task overhead related to stealing, because this is expected to be amortised down to a very small part of the aggregate task overhead.
Now, would it be possible to repeat the experiment where you have more than just one task per available hardware thread? So instead of 2 tasks you should have maybe 100 tasks or more for meaningful parallel slack.
My prediction, conditional to having understood thing correctly, is the following. So far you have really observed evidence that it takes a while for another worker to steal a spawned task, and a small task may be executed locally before it is stolen with a likelihood related to how quickly the first task is executed, i.e., to how small the tasks are. Since stealing, idling owing to dearth of parallel slack, and waiting incur a significant penalty, and with individual and aggregate task overhead the same, this overhead will gradually increase on average with increasing task size (decreasing percentage of locally executed second tasks), until it reaches a plateau (second task nearly always stolen). You can probably diagnose whether the second task was in fact stolen, and incorporate this number into the statistics. If you increase the number of tasks, the percentage of tasks that were not stolen will decrease compared to a lower number of tasks of equal size, and the plateau will be reached earlier relative to task size. If you use recursive parallelism to create the tasks, e.g., by using tbb::parallel_for(), the plateau will be much lower (better scalability), but I'll abstain from predicting how quickly it is reached in this case.
Hopefully this is more substantial than my kneejerk speculation in #1. Ideally you would modify your experiment to prove or disprove the hypothesis, but of course I would also be happy if you would simply believe me. :-)