Intel® oneAPI Threading Building Blocks
Ask questions and share information about adding parallelism to your applications when using this threading library.
2464 Discussions

Use of pthread locking primitives in concurrent_queue

milom
Beginner
447 Views
I noticed the code for the concurrent_queue uses pthread or Windows synchronization primitives directly. That is, it doesn't use either spin_mutex or even the "mutex" wrapper around the system mutex code.

Was this done intentionally? Or is it just that the code hasn't been updated to use the lower-level primitives?

I guess part of my confusion is that I think of TBB as "stacking blocks" in which TBB provides wrappers for the lowest-level primitives (atomic operations and theads), a layer of synchronization (locks and such), mid-level primitives such as containers and the task scheduler, and then very high-level operations (such as parallel_for). I found that concurrent_queue "skip a layer".

0 Kudos
6 Replies
ARCH_R_Intel
Employee
447 Views

Optimization is the root of all ugliness. We were going after maximum performance, and so tailored the concurrent_queue implementation to the synchronization primitives specific to each platform. In particular, we wanted:

  1. No locking when no blocking had to be done, and keep atomic operations at a minimum, because they are slow.
  2. Get the thread off the OS's run queue when blocking has to be done. The TBB 2.0 concurrent_queue did busy waiting sometimes, which was a problem for some users.

Credit goes to our developer Wooyoung Kim for coming up with an implementation that met the above desired properties. It is complex. Indeed, as for some of our other complex algorithms, Spinwas used for checking during development.

It would be nice to have a layerin TBB for doing this sort of thing (i.e. properties 1 and 2 above) so we do not have to write so much platform-specific code. My preference, if we can figure out how to support itacross all platforms, is to provide futex-like functionality in TBB.

0 Kudos
milom
Beginner
447 Views
I see.

It seems that most of the OS-specific code is for the signal/wait conditional variables and such (to avoid the busy waiting you mentioned). For doing task-based parallel code, it seems that using such conditional variables would be that helpful. Is that the case? Or were these users doing something other than task-based parallelism?

Which I brings up another question? Do all the worker threads continue to busy-wait when no parallel tasks are running (say, during a sequential region of code)? It seems like that would be a bigger worry.

P.S. I'm glad to see that Spin was used for checking the queue. I've actually done some work on model checking lock-free data structures myself.

0 Kudos
ARCH_R_Intel
Employee
447 Views

Yes, condition variables are not good for task-based parallelism, which is why they are currently missing from TBB. The TBB containers, however, are intended to work for both task-based and plain-old-threading parallelism.

The class concurrent_queue class is particulary problematic in task-based parallelism for several reasons:

  1. blocking - consumers have to wait for producers, hence the underlying thread may be tied up doing nothing for a while.
  2. parallelism is always optional, not mandatory in task-based parallelism. So the programmer has to be careful to avoid depending on parallelism for progress (e.g., the producer of an item must run before the consumer of an item!)
  3. [joke] A queue is a data structure that lets your data get optimally cold (evicted fromcache) before touching it again.

Users not using task-based parallelism find concurrent_queue useful.Likewise for the mutex and atomic operations.

We would have loved to have your checker a few years ago when we were debugging our queuing_rw_mutex on Itanium and optimizing our useof fences.

Our handling of idle worker threads has been evolving with various releases. Here's a summary of the evolution:

  1. [original unreleased prototype of TBB] Just spin.
  2. [TBB 1.0] Spin if any worker thread is busy. Sleep if in serial section.
  3. [current] Sleep if work cannot be found. Wake up if any task is spawned.

What caused us to switch from (2) to (3) is programs that lacked parallel slack; e.g. had 2-way parallelism on an 8-core machine. (3) is a little hyperactive in that spawning a single task wakes up all workers, but seems to work well in practice because assuming there is parallel slack, once the number of spawned tasks transitions from 0 to 1, more tasks come quickly to fully subscribe the machine. If there is not parallel slack, the extra workers soon go back to sleep. The reason we do not do more precise sleeping/waking is that we have not found a scalable algorithm for doing such in a task-stealing environment. E.g., keeping a centralized counter of the number of ready tasks would be a bad bottleneck on stock hardware.

0 Kudos
milom
Beginner
447 Views
Thanks for the clarifications. Makes sense to me. Part of the confusion likely is related to the concurrent_queue vs pipeline issue. It seems that the pipeline model of TBB is often the right thing, but I could see why programmers might use concurrent_queue to build their own pipeline-like operations.

However, I'm not sure I agree with the implication that explicit locks and task-based parallelism don't mix. Sure, sometimes tasks are strictly data-parallel and work only on private or their partition of data.

Yet, I'm seeing the use of some parallelism in which the tasks access read/write data structures. One example is a dynamic programming sort of problem in which expensive computations are cached in a shared data structure to prevent redundant computation.

In such a context, the mixing of locks and tasks is fine. The blocking nature of locks are unlikely to be a problem, as any task won't hold an lock for very long.
0 Kudos
ARCH_R_Intel
Employee
447 Views

Shortly-held locks are certain okay for task-based programming. That's what we designed the TBB mutexes for.

Condition variables seem to be inherently a problem in task-based parallelism, particularly if the parallelism is optional, not mandatory. If a task is waiting on a condition to occur, and there is no parallelism, then there is no task to set the condition.

0 Kudos
RafSchietekat
Valued Contributor III
447 Views
Unfortunately I'm only doing some low-level tinkering at the moment, so I hope this is not too naive: what would be the essential difference between a condition variable and a subtask that needs to complete, so that the task scheduler couldn't just assume the administrative burden of translating the signalling of condition variables to the releasing of a task? One difficulty seems to be to decide how to implement waiting: if it is not safe to start a new task on the same stack (probably not), some mechanism has to be available to pass the core to another thread until a good point is found to switch back (blocked, end of task, ...). It may not be safe either to completely suspend a waiting task's thread, but perhaps changing its priority could do the trick. Or is this (mostly) nonsense?
0 Kudos
Reply