Intel® oneAPI Threading Building Blocks
Ask questions and share information about adding parallelism to your applications when using this threading library.

thread seemingly executing multiple tasks simultaneously

ed_b_1
Beginner
639 Views

I have two function node body classes that subclass from the same parent. The parent has a thread-local member variable. Each child class sets that variable to its own value at the top of its operator() function, and then prints out this value some time later. Presumably, if a thread were processing one task at a time, everything would be fine with this structure. However, I occasionally observe that the printout made by class A contains class B's version of the member variable. Could somebody offer an explanation for this behavior?

0 Kudos
9 Replies
RafSchietekat
Valued Contributor III
639 Views

It seems like expected behaviour for a thread to sometimes steal a task, execute it, and later come back to where it was, when given the opportunity by some form of nested parallelism. Each task only gets executed by a single thread (disregarding recycling), not necessarily contiguously, but still strictly nested. So you could get ABA, or ABCA, or ABCBA, etc., on a particular thread, but not ABABA.

0 Kudos
ed_b_1
Beginner
639 Views

Could you refer me to a doc page and/or a discussion on task scheduling where these details are mentioned? This would have been hugely helpful when I was designing my flow graph, but I somehow missed it in my reading of the docs. My design assumed that I could use thread-local's to carry state information of a body object, both over the course of execution of a single task, and also from one task to another involving the same body. However this assumption is completely invalidated by the scheduling paradigm you describe. Also-and perhaps this will be clarified by the docs-is there any way out of this situation, such as, for example, a way to make certain tasks (or portions of tasks) look atomic to the scheduler?

0 Kudos
RafSchietekat
Valued Contributor III
639 Views

One of the basic principles of TBB is load distribution through task stealing. That means that, whenever a thread is blocked inside the scheduler, perhaps waiting for a subtask to complete in a parallel_for, it is allowed to steal a task from another thread to keep busy. So actually you do get an early indication that the association between a thread and a task is not necessarily contiguous. That said, there has to be a specific opportunity for a thread to steal another task, so if there are no TBB algorithms or so running in a particular piece of code you should get the contiguous association that you want (I'm rather picky about what to call "atomic"...).

0 Kudos
ed_b_1
Beginner
639 Views

In your example with the parallel_for launched from a function node body, the tasks associated with the parallel_for are nested within the flow graph parallelism and so have higher priority to complete. In my case, the tasks are of equal priority, and I would have expected the task stealing to be done more conservatively in such case (i.e. by threads that were done with their own tasks). Fwiw, there are no tbb calls in my node body, but there is a blocking call to a function.

I am looking for a workaround that would prevent certain tasks from being interrupted in the way described in previous posts. So far, the best I've come up with is for the body to have a thread-local counter of tasks "in flight", and to transparently recycle the input associated with any task back to the input port of the node if the counter is not at zero. With sufficient number of threads, the chance that this recycled task would be continually hitting the same thread is minimal ( in my case, only about 10% of the tasks must be thread-exclusive; the other 90% don't care). I will appreciate smarter solutions to this problem.

 

0 Kudos
RafSchietekat
Valued Contributor III
639 Views

There is no "higher priority" for tasks associated with a nested parallel_for, that's another mechanism. The scheduler will of course favour executing tasks from its own pool, and so local tasks tend to execute locally, but perhaps another thread stole one of those nested tasks, and then while the local thread is waiting for that other thread to complete it would go out stealing itself. No honour among thieves in TBB land...

What do you mean with "a blocking call to a function"? Even if the thread is basically doing nothing, e.g., literally sleeping, it would not be able to steal other work unless TBB is aware of it being idle. You should then be able to predict that the Body operates without "interruption" (although you should not consider that necessarily "better", because nested parallelism is generally a good thing), because TBB is not preemptive that way.

Your workaround sounds weird, subverting what a flow graph should be doing, but I have no idea what you are trying to do and why you want to avoid nested parallelism where the thread might steal a task executing the Body of another function_node. Is that TLS just for detecting something, or for basic functionality? Why not just use an automatic variable for execution-local state?

(2015-10-17 Edited) queue->pool (it's actually a deque)

0 Kudos
ed_b_1
Beginner
639 Views

Raf Schietekat wrote:

What do you mean with "a blocking call to a function"? Even if the thread is basically doing nothing, e.g., literally sleeping, it would not be able to steal other work unless TBB is aware of it being idle.

This is a call to one of several waiting type functions in the OpenCL API. <a href="https://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/clWaitForEvents.html">For example.</a> I don't know the specifics of how these functions are implemented, but based on empirical observations, TBB seems to be aware when one of these functions is invoked and seems to start on another task.

Raf Schietekat wrote:

Is that TLS just for detecting something, or for basic functionality? Why not just use an automatic variable for execution-local state?

I use TLS to store pointers to OpenCL and CUDA host and device buffers, which take quite a bit of time to allocate. To save this allocation overhead, I reuse the buffers for similar tasks. But this only works if tasks that operate on the same set of buffers don't step on each other. Which, it appears, they do.

I agree that my solution is a messy one. Ideally, one would be able to signal to TBB that "this task must run to completion". Failing that, there could at least be a way to trick TBB into thinking the thread is busy while it's waiting for the device to complete.

0 Kudos
RafSchietekat
Valued Contributor III
639 Views

That seems rather implausible, except if the OpenCL implementation uses TBB internally. You can even use sleep(3), and TBB won't make any attempt to do anything with that blocked thread.

0 Kudos
jimdempseyatthecove
Honored Contributor III
639 Views

Are you by chance incorporating Cilk Plus into your application? Task implementation in Cilk Plus permit a thread change for the continuing thread across a task invoking statement (parallel construct). (the stack pointers and registers are context switched but not the TLS context)

if not,

Are you confusing Tasks with Threads? By this I mean you start multiple tasks, at some point of operation each Task (thread) allocates buffers and inserts pointers into the then TLS buffer pointer of the running thread of the respective task. Assume immediately following the initialization of the TLS buffer pointer your task(one of each that initialized a buffer and inserted pointer into its TLS) issues a parallel_something... Are you expecting each spawned task(s) to have the same TLS buffer pointer as the spawning task?

Are you aware that TLS management is not necessarily global. Meaning TBB, OpenCL, OpenMP, CUDA, MS C++, Intel C++ (, Intel Fortran, ...)when mixed within a single application may or may not have interfering TLS management. Therefor, (re prior paragraph) upon completion of the parallel_something... if for example Cilk Plus were involved somewhere deep in the parallel_something you could expect a thread change upon return. As to if this is the case with  OpenCL, OpenMP, CUDA, MS C++, Intel C++ (Intel Fortran) I cannot say.

Jim Dempsey

0 Kudos
ed_b_1
Beginner
639 Views

ed b. wrote:

Fwiw, there are no tbb calls in my node body,

This was a falsehood. One of the node bodies had a parallel_for call, and that's the only task that was ever being interrupted. Should have seen this earlier. I've now taken it out, and the problem described in the original post has not reoccurred. So yes, the execution of a task can only be set aside (aka "interrupted") by TBB if that task is itself spawning TBB tasks. Thanks for sticking to your guns on that, otherwise I'd still be looking for the problem elsewhere.

 

jimdempseyatthecove wrote:

Are you expecting each spawned task(s) to have the same TLS buffer pointer as the spawning task?

No, of course not. But this is a good point that bears repeating to anyone that's using TLS and spawning TBB tasks.

 

 

0 Kudos
Reply