query scheduler for currently spawned task estimate

Alexander_Herz · ‎08-21-2012

Hi,

I haven't found anything in the docs, so is there any way to query the scheduler for (an estimate) of the amount of currently scheduled tasks? Imagine some recursive algorithm (div & conq. style) that can choose to either spawn tasks or do the computation sequentially. I don't want to flood the scheduler with bazillions of tasks, especially if there aren't enough workers available so that most tasks will be done sequentially anyways. Also, the task size tends to decrease the deeper the recursion gets.
Imho, a nice way to solve this would be to spawn tasks until the scheduler is "full" (so the currently scheduled tasks >= worker threads) and then go on sequentially. That way the early large grain recursion levels are done in parallel and the small stuff doesn't flood the scheduler.
Of course, I could count all (or just the tasks submitted by the recursion) myself (atomic inc/dec at task start/end) and comepare that to the max worker threads. But that would add a highy contended counter to all my tasks.
Probably it would be sufficient to get an estimate of the current queue size of the current task's worker, so that no synchronization would be needed.

Thx,
Alex

RafSchietekat · ‎08-21-2012

Providing this information will probably not be scalable, offsetting any benefits or worse, as you already indicate yourself relating to atomic counters.

What makes you think you can do significantly better than imposing an absolute lower limit on work for a task and creating them recursively, thus never "flooding" the scheduler? Task pool size is an extremely unreliably indicator for pending work because of recursive parallelism, with errors exponentially magnified.

auto_partitioner makes its decisions dynamically based on stealing, so maybe you could use it or do something similar if you do find that you cannot keep parallel overhead to a reasonable limit any other way, which I doubt. But its rationale is to spare you the burden of tuning minimum task size, not so much having as many big tasks as possible, which is merely the strategy to achieve the goal.

anton_pegushin1 · ‎08-21-2012

Hi,

why are you afraid of spawning bazillions of tasks exactly? Is that a memory limitation you need to comply with or do you want to make sure your worker threads have a chance to "come out for air" every once in a while (rather sooner than later too)? Because if you think about it, knowing number of tasks does not really help you with either (a) or (b). Tasks are just light-weight C++ objects with a very small memory footprint. If you application is doing some real work, then memory allocated from inside the tasks (work buffers, etc.) should be much larger than the actual task size. But in that case, the number of tasks running in parallel is equal to number of CPUs and if you know memory footprint of your 'inside-of-the-task' code, then you can easily estimate the total memory footprint. If it's (b), which is when you are trying to control the computation time of your algorithm, then (especially if auto_partitioner is used), number of created tasks would not map onto the amount of work left. In that case (b), again, knowing the complexity in CPU cycles of your inside-the-task code (or your parallel_algo_body code), when the function gets called with some type of user input from the size of the input you could estimate the amount of work needed to process all of that data. And if that's too long, then you split the user input and process the data gradually.

Alexander_Herz · ‎08-21-2012

The problem is that I have no general way of knowing the work for a task I'm about to spawn recursively (for non-recursive tasks I have some rough estimates and maybe some profiling info which is more exact than the estimate). So within the recursion I need some way to decide when to switch from spawning (potentially exponentially) more tasks to just doing the computation sequentially. This cutoff also happens to define how much work a task is doing (switching from spawn to sequential early gives bigger tasks as the tasks will perform the recursion). And I need a generic way, not something tuned for a single problem instance.

I think, the fill state of the scheduler (or just the current worker thread) would be a good indicator here. If the scheduler is very full, it is quite likely that spawning more tasks will bring no benefit (the pending tasks in the scheduler may have been spawned by previous call of the recursion or from some other part of the program).

If the scheduler is rather empty, then either the previously spawned tasks have already been processed or the recursion is just starting.

Using auto_partitioner sounds interesting, how do you suggest to actually implement that? I guess I could check if the current task inside the recursion was stolen (indicating the scheduler is not full) and spawn more only inside stolen tasks?

Thx,
Alex

Alexander_Herz · ‎08-21-2012

Generating a lot more tasks than there are worker threads will create a lot of overhead (spawning a task and waiting for it to complete takes about 3k cycles on my machine (of course excluding the actual work done by the task). If the generated tasks (the bazillions but only e.g. 4 workers exist) largly cannot be run in parallel then the overhead cannot be recuperated at all, the overall execution time just increases (irrespective of the memory footprint of the tbb tasks).

I forgot to mention that I work with plain tasks there is no parallel_for or anything the like.

An illustrative example would be examples/task/tree_sum/SimpleParallelSumTree.cpp

Here a predifined and rather random cutoff is used to switch between a completely sequential version (node_count<1000) and a completely parallel version.
I want to switch between parallel and serial version dynamically as soon as a a proper amount of tasks was spawned by the paralle version, and the question is how to dynamically determine the switch.

Alex

Anton_M_Intel · ‎08-21-2012

What kind of generality do you need? Or in other word, is parallel_for with custom range implementation enough for you? Parallel_for/reduce represent a big subset of generic recursive devide&conquire approach. And auto_partitioner already implements the decision whether to spawn a task or execute sequentially based on stealing feedback.

[P.S. to Anton Pegushin]: welcome back ;-)

Alexander_Herz · ‎08-21-2012

I just clarified that in the other answer to Anton Pegushin. Can I utilize the auto_partitioner together with tasks?

Thx,
Alex

Anton_M_Intel · ‎08-21-2012

No, but your tasks form a tree, you can do the same with parallel_for/reduce [and custom range]

anton_pegushin1 · ‎08-21-2012

Hi,

if you really don't know how much work it is to execute your average task, then as I said before knowing how many tasks there are in the system is useless and even dangerous. Consider scenario, when the whole job of your application is just per-element adding of two vectors of size 1000. You create a task for this job, then you check the number of tasks in the system (it's one) and you decide to recursively divide up the work. Let's say you have 24 cores on the machine you're running the app on and you recursively divide the tasks untill there are 10 times the number of tasks in the system as there are cores (to give it a chance to balance the load, but still not overwhelm), so it's 240. This will leave you with 240 tasks, which are (according to you, cost thousands of CPU cycles to create and spawn) and each one of them just adds 4-5 pairs of numbers. Now, did the knowledge help?

You brought up the example from TBB distribution, tree_sum. The cutoff parameter value is not as random as it seems. Estimate the cost of task management (creation, spawn, clean-up, etc.), this is your parallelization cost. You can't not have it, but in order for the application to scale, the cost needs to be some 1-5% of the whole job of the application (10% maybe, the choice is yours). Then you go into your code and look at whatever the useful work unit the leaf task is going to be performing, you estimate the number of CPU cycles necessary to complete this one atomic computation and then you decide how many of these atomic computations needs to be combined in one leaf task in order for the parallelization cost to be adequate or minimal. That's how you get to the value 1000 for the tree_sum example.

And lastly, even though you can't use auto_partitioner with plain tasks, it does not mean that you can't go to the partitioners implementation and look up the criteria TBB developers are using to decide whether it's time to execute or continue spliting the task - that's a good thing about an open-source library.

jimdempseyatthecove · ‎08-22-2012

Here is a partial solution suggestion.

Create a Thread Local Storage variable that maintains a count of pending tasks or nest levels for your thread. This variable can then be updated without interlocks (i.e. fast/low overhead). When the count exceeds X you do not spawn/parallel_... (you specify X). This also does not interfere with you additionally using an estimated load (e.g. vector size .gt. 1000) for decision to go parallel or not.

Jim Dempsey

Alexander_Herz · ‎08-28-2012

HI,

thx for the suggestion. I'll try that.

Alex