I want to make sure there is the following guarantee on blocking sibling execution. I think it was implied by the documentation of blocking. But I'm not sure:
The thread that executed task X cannot start a sibling of X until X is complete.
The relevant scenario is that some parent creates task X and one or siblings. One or more siblings do not start at that time due to shortage of available worker threads. X creates several children and waits blocked by those children and X's thread works on one or more of those children. Meanwhile the slowest child of X is stolen by a different thread. Then X's original thread runs out of descendants of X to work on, but that slowest child is still in progress on a different thread, so that thread cannot resume executing X yet.
I'm hoping that thread can steal a child of one of its siblings if that sibling had been started by a different thread, but cannot take one of its siblings if that sibling had not started at all. If no tasks other than siblings of X are ready to execute, I would want that thread to spin or sleep.
After resuming and completing X, I would want that thread to be able to start a sibling of X. It is only because it started and did not complete X that I would want it prevented from starting work on a Sibling.
If that guarantee is not built into the definition of blocking, is there some reasonable extra step I can take to get that?
I don't see any rigorous definition of "sibling" in TBB, and my question assumes we know what "sibling" means. For my purposes, all siblings were created "together" by one execution of one function by one parent task. It will also be true that all siblings have the same successor task and if it helps to make this work, that successor could also be a blocking parent (The fact that X itself is blocking is inherent in my question).
The official answer would probably be that it is not officially guaranteed. But you can also probably rely on it anyway.
It once used to be that stealing happened only from greater depth, which would indeed have provided the desired guarantee, but only until the implementation changed. Maybe it'll change again?
Meanwhile, threads don't steal from themselves, so if these are really sibling tasks (and not sibling ranges!) you can probably rely on this. That said, it is mostly not the best solution to generate many siblings of equal size.
However... how does that help you? The spawning thread may happen to observe mutual exclusion between sibling tasks, but other threads definitely will not.
Specifically, if thread T1 spawns X1, X2, ..., X8, and X9, and starts with X9, then X8, etc., then maybe thread T2 steals X1, spawns some descendant tasks, waits for all, does not have any tasks in its local queue because T3 stole a task and is still busy with it, ...and then T2 steals X2? In that situation you have a thread (T2), albeit not the original one (T1), that's executing sibling tasks (X9 and X8) "at the same time" (re-entrantly would be the word, I guess).
So, to conclude: A previous implementation of the scheduler, so long ago that I don't remember the release number off the top of my head, would never have re-entered sibling tasks in any thread, but in the current implementation threads other than the spawning thread have no such qualms. You may not see it happen in your tests, but Murphy guarantees it will happen in the field.
Sorry for my rather careless reply above, where I only dealt with a literal interpretation of the question. The same topic has even come up before, if I'm not just having a déjà vu now.
Raf Schietekat wrote:
Specifically, if thread T1 spawns X1, X2, ..., X8, and X9, and starts with X9, then X8, etc., then maybe thread T2 steals X1, spawns some descendant tasks, waits for all, does not have any tasks in its local queue because T3 stole a task and is still busy with it, ...and then T2 steals X2?
My reason for wanting this arises in managing some of the private storage issues in converting code that was not originally designed to be multi-threaded.
That storage is not exactly thread local. It may end up being used by multiple threads after sub tasks are stolen. At the lower level the data use is thread safe by application design and should not be private. But across a specific set of sibling tasks at a higher level, the storage must be duplicated. I was hoping to manage that in a way that depends on threads not being reentrant at that specific level.
I thought the design targeted at avoiding the risk of unlimited stack growth in a thread was documented as preventing the reentance I'm worried about.
In your description, I don't see how T2 can steal X2 before X1 finishes. But I am just a beginner at this. If T2 could steal X2 before X1 finishes, where does the protection against unlimited stack depth kick in? T2 could then steal X3 before X2 finishes, etc.
Is there a better way to divide X1 ... Xn? They are on average far more than large enough that the simple division into tasks is not too fine grain. Because they are highly variable in duration, batching them together before splitting into tasks is harmful (in addition to being unneccessary for any task overhead concerns). But the storage requirement would be a big problem if the number of these Xi that are "running" concurrently was larger than the number of threads. The storage management would be simpler if each thread was restricted as I originally asked: Cannot start any other of the Xi if its stack contains a waiting copy of Xj. But even with more complicated storage management, I need a way to limit the growth, which ultimately means a way to limit starting Xi when too many other Xi are waiting. I guess I could work out how to do that explicitly with a collection of dummy tasks to manage the set of real tasks. I was just hoping the definition of blocking saved me from needing to do that.
Unlimited stack growth was avoided by not allowing a thread to steal except from greater task depth, in an earlier implementation. But that changed, and now threads merely stop stealing when their stack is already about half full.
If the concern is the multiplicity of simultaneous execution rather than the re-entrancy, perhaps the most natural fit is a pipeline, where you can explicitly limit the number of data items in flight. So instead of spawning child tasks, you have objects in a container, and you run a pipeline with a parallel stage that invokes their execute() member function (or whatever you'll call it), still allowing for recursive parallelism.
On the other hand, if re-entrancy is indeed not a problem, what makes you think that multiplicity will explode in your original design? Did you observe this already, or are you just worrying about the possibility? TBB doesn't want threads to steal tasks except if there's no other way to keep busy, because stealing is less efficient than executing local work but more efficient than idling. So the conditions I described above are not the most likely to occur, and should probably only be a concern if there are strict limits that must never ever be exceeded.
Don't overcomplicate things!