The scheduler currently has an "all or nothing" mechanism for gating its worker threads. If any work is spawned, then all the workers are turned loose and start looking for work. Those that cannot find work periodically yield. So if you are turning loose just a few tasks, all the worker threads will appear busy.
The scheduler is designed this way because in the Cilk-like paradigm, a program is expected to generate many more tasks than there are threads, and therefore the machine should saturate anyway. Having that abundance of available parallelism, and then exploitingjust enough to keep the threads busy,is necessary for scalability.
We could of course count the number of available tasks, and run only that number of threads. But that introduces a central counter that becomes a point of contention that can really hurt performance.
Internally, we've been kicking around a better algorithm for gating the threads that might deliver the sort of behavior that you expect. We haven't released it because there didn't seem to be enough motivating use cases to justify it. But perhaps you have one. If you want to submit your example through the contribution process, we can investigate whether our better algorithm does in fact solve the problem and is worth releasing.
I should have mentioned how to do scalability studies with TBB. For scalability studies, set the number_of_threads argument for task_scheduler_init. Ideally, the rest of the program should ignore the number of threads and just create as much potential parallel work as possible, while keeping grain sizes large enough that parallel overheads do not swamp useful work. The parallel depth-first work-stealing strategy in the TBB task scheduler usually does a good job of turning that potential parallelism into the right amount of actual parallelism.
Thanks for the detailed description of your situation. Your point about latency versus parallelism is an interesting one. The better algorithm that we have been kicking around would fix the polling issue. It would not fix the number of threads issue.
You problem is the dual of another problem we've been looking at: blocking tasks. If a task starts a blocking operation (like I/O), it would make sense for TBB to fire up another worker for the duration of the block. You've presented the opposite situation: when one of your threads gets a request (and is thus running), you need a way to decrease the number of TBB workers. You've got me thinking about how to combine the solution for both.
I'm not sure what is causing the quantitization. I saw a similar effect here for a benchmark once, and thought it was a pecularity of the operating system. But now you're got me wondering if there's a subtlety we've overlooked.
I resurrected the thread gating algorithm we had been kicking around. After I repaired a flaw in it, it did fix the polling issue. (Your example revealed the flaw. Thanks!) I ran a task_scheduler_init forP threads and ran a parallel_for forM tasks, and the "top" command reports only M*100% utilization of the machine when M
The bad news is that the change degrades performance of a affinity_partitioner variatoin of our seismic wave benchmark,by 2x-4x on Linux. I have not tried other operating systems yet. What seems to happen is that the new algorithm is aggressive about blocking software threads (via a futex) when they are not busy. When a software thread wakesup again, the affinity toits hardware thread seems to be lost. The problem seems to be largely limited to affinity-sensitive benchmarks. I have not tried explicit affinity masks yet for the threads; but in the past those have degraded performance for most benchmarks.
Regaining the performance would seem to boil down to the common programming trick of polling for a short while before sleeping. The tricky part is figuring out how to automatically tune the polling duration. Maybe it should even base the duration on whether affinity has been specified for recent tasks.
Anyway,we can likely get it into a developer release by the end ofJanuary.
I'm back from vacation and reworked the thread gating algorithm today so that it does not foul the affinity partitioner. Here's a summary of the new logic:
A worker tries to steal for a while, and if unsuccessful, it starts taking a global snapshot of all the task pools. If all pools are found to be empty, the thread blocks until another task is added to one of the task pools.
To deal with core-to-thread affinity issues, the "tries to steal for a while" period is extended a little bit beyond the minimum necessary, so that if a program alternates rapidly between serial and parallel sections, the workers do not block, and thus do not lose their core-to-thread affinity (as I was seeing when the blocking was being done via a Linux futex).
The change should show up in the next development release, likely by mid-January.