I have a simple parallel_for loop that basically runs on the blocked_range<0, TASKS, 1), or in other words, creates exactly TASKS tasks. To measure scalability, I tried varying the number of tasks from 2 to 8 on an 8-core machine (2x4). I always initialize without specifying the number of threads, so I assume 8 are created.
While running this experiment, I noticed using top(3) that pretty much all the CPUs are 100% or nearly so. This is true regradless of the value of TASKS, although I verified (with printfs) that the correct number of tasks is indeed created.
Anybody else observed this behavior and/or can explain what the other cores are so busy doing?
The scheduler currently has an "all or nothing" mechanism for gating its worker threads. If any work is spawned, then all the workers are turned loose and start looking for work. Those that cannot find work periodically yield. So if you are turning loose just a few tasks, all the worker threads will appear busy.
The scheduler is designed this way because in the Cilk-like paradigm, a program is expected to generate many more tasks than there are threads, and therefore the machine should saturate anyway. Having that abundance of available parallelism, and then exploitingjust enough to keep the threads busy,is necessary for scalability.
We could of course count the number of available tasks, and run only that number of threads. But that introduces a central counter that becomes a point of contention that can really hurt performance.
Internally, we've been kicking around a better algorithm for gating the threads that might deliver the sort of behavior that you expect. We haven't released it because there didn't seem to be enough motivating use cases to justify it. But perhaps you have one. If you want to submit your example through the contribution process, we can investigate whether our better algorithm does in fact solve the problem and is worth releasing.
I should have mentioned how to do scalability studies with TBB. For scalability studies, set the number_of_threads argument for task_scheduler_init. Ideally, the rest of the program should ignore the number of threads and just create as much potential parallel work as possible, while keeping grain sizes large enough that parallel overheads do not swamp useful work. The parallel depth-first work-stealing strategy in the TBB task scheduler usually does a good job of turning that potential parallelism into the right amount of actual parallelism.
Unfortunately, I cannot share the code, but I can however describe the problem in precise enough terms. I believe the issue may be general enough to warrant the interest of other developers.
I have a server that responds to requests, which can take varying amount of times to calculate. My goals are to reduce latency (time per request) on the one hand, while increasing the throuput (requests per unit time) on the other. By parallelizing each request to the maximum, I can reduce latency easily enough. However, that is not always smart: Parallelization has overhead and cost due to sublinear speedup, which is inherent to the request. Parallelizing more will lower latency (to a point), but at a cost of lower efficiency.
So instead, I have a heuristic that limits to parallelization to a value that would produce latency that is "low enough", while keeping the other cores free to handle other requests, thus improving throughput. In other words, I have multiple threads processing requests, each parallelized to the smallest no. of tasks/threads that will bring the latency below a latency threshold.
In this model, I cannot take advantage of the fine-grain parallelization that TBB excels in, since each additional tasks incurs non-negligible overhead. No. of tasks should match no. of threads (and tasks are well-balanced, so there's no benefit in task stealing). So what I end up doing is a parallel_for with a blocked_range(0,tasks,1), and decompose the data inside each task.
I've noticed two problems so far: the first mentioned above, where even if tasks is set to N
The other problem I noticed relates to scalability tests I ran, similar to the one you suggest above. Running a single request with t=1..N tasks and initializing scheduler_task_init(t) with the no. of tasks each time creates "quanta" of scalability: it is no longer a smooth speedup curve but a step function. Is it possible that parallel_for with blocked_range
Here's some example data from my scalability run, setting init to # of tasks:
Forgive me for rambling but I find this very interesting. I may end up concluding however that a pthreads model is more appropriate for this particular family of probl ems, even though I like TBB better :)
Thanks for the detailed description of your situation. Your point about latency versus parallelism is an interesting one. The better algorithm that we have been kicking around would fix the polling issue. It would not fix the number of threads issue.
You problem is the dual of another problem we've been looking at: blocking tasks. If a task starts a blocking operation (like I/O), it would make sense for TBB to fire up another worker for the duration of the block. You've presented the opposite situation: when one of your threads gets a request (and is thus running), you need a way to decrease the number of TBB workers. You've got me thinking about how to combine the solution for both.
I'm not sure what is causing the quantitization. I saw a similar effect here for a benchmark once, and thought it was a pecularity of the operating system. But now you're got me wondering if there's a subtlety we've overlooked.
The quantization problem I discusses was caused by a regression (a bug:-) in the way by which I implemented the parallel_for. So instead of blocked_range<0,MAX_TASKS,1>, I had something like: blocked_range
As for the more interesting problem of limiting CPU usage, I ran the following experiment: I eliminated all scheduler_task_init before the routine that runs the parallel_for. Then, in the routine's scope, I create a task_scheduler_init(MAX_TASKS), which then guarantees that the CPU usage==no. of threads==no. of tasks. However, the extra overhead of re-creating the thread pool (about 0.2s apparently, in my config) is not negligible.
Do you have any other suggestions I might try before attempting pthreads?
BTW--it's only the polling issue I'm concerned about, not the threads issue--I don't care if there's 100 threads dormant, as long as they don't create noise in the system. Any thoughts on when that's going to get implemented? :)
I resurrected the thread gating algorithm we had been kicking around. After I repaired a flaw in it, it did fix the polling issue. (Your example revealed the flaw. Thanks!) I ran a task_scheduler_init forP threads and ran a parallel_for forM tasks, and the "top" command reports only M*100% utilization of the machine when M
The bad news is that the change degrades performance of a affinity_partitioner variatoin of our seismic wave benchmark,by 2x-4x on Linux. I have not tried other operating systems yet. What seems to happen is that the new algorithm is aggressive about blocking software threads (via a futex) when they are not busy. When a software thread wakesup again, the affinity toits hardware thread seems to be lost. The problem seems to be largely limited to affinity-sensitive benchmarks. I have not tried explicit affinity masks yet for the threads; but in the past those have degraded performance for most benchmarks.
Regaining the performance would seem to boil down to the common programming trick of polling for a short while before sleeping. The tricky part is figuring out how to automatically tune the polling duration. Maybe it should even base the duration on whether affinity has been specified for recent tasks.
Anyway,we can likely get it into a developer release by the end ofJanuary.
I've run another experiment today to measure the effect of this polling. I took two different TBB applications. The first (the one I'm interested in, call it "A"), which runs a computation that takes about 7s when limited to 4 threads (or about, 75s when run 10 times in a wrapper). The other application ("B") is a "stressor" that takes about 80s to compute with 4 threads. Both these times were obtained when run separately from each other.
First, I ran both applications together (on an 8-way machine), with A calling scheduler_task_init init and B calling scheduler_task_init init(4). Consequently, application B took about 7% longer to run. A exhibited a more irregular pattern, with 8 of the 10 runs taking about the normal 7s, and the other two taking about 11.6s and 13.3s.
There are several other possible sources of interference that have nothing to do with TBB's polling: I/O and memory contention, OS noise, etc. However, while application A is memory-intensive, application B is mostly compute-bound, with very little RAM usage (less than 2M resident). The OS could be a factor, but when measuring either application independently, they scale reasonably well to 8 threads, so I suspect it's ok.
To verify that the source of the interference is TBB's polling in application A, I ran a second experiment: I limited both of the init's to 4 threads. And indeed, run times for both applications were unaffected, compared to running them in isolation.
So the fix to the polling issue is indeed required for predicatable performance in our setup. I don't understand the change you made and how it affected affinity, but perhaps I can make a naive suggestion: instead of doing away with polling altogether, perhaps you can just implement an exponential backoff polling delay. Or better yet, poll not on time intervals, but rather whenever a task is added to any run queue (the completion/deletion of a task should probably not trigger a poll as stealing tasks cannot improve things from the previous stealing attempt).
I'm back from vacation and reworked the thread gating algorithm today so that it does not foul the affinity partitioner. Here's a summary of the new logic:
A worker tries to steal for a while, and if unsuccessful, it starts taking a global snapshot of all the task pools. If all pools are found to be empty, the thread blocks until another task is added to one of the task pools.
To deal with core-to-thread affinity issues, the "tries to steal for a while" period is extended a little bit beyond the minimum necessary, so that if a program alternates rapidly between serial and parallel sections, the workers do not block, and thus do not lose their core-to-thread affinity (as I was seeing when the blocking was being done via a Linux futex).
The change should show up in the next development release, likely by mid-January.