Intel® oneAPI Threading Building Blocks
Ask questions and share information about adding parallelism to your applications when using this threading library.

why concurrent_queue not use Lock-Free Queue?

softarts
Beginner
1,520 Views
I don't think concurrent_queue/micro_queue is efficient, it probably need to wait if 2 threads access the
same micro_queue,why not consider implement micro_queue as Lock-Free Queue?
0 Kudos
8 Replies
Dmitry_Vyukov
Valued Contributor I
1,520 Views
Quoting - softarts
I don't think concurrent_queue/micro_queue is efficient, it probably need to wait if 2 threads access the
same micro_queue,why not consider implement micro_queue as Lock-Free Queue?

What lock-free implementation do you have in mind?

0 Kudos
softarts
Beginner
1,520 Views
Quoting - Dmitriy Vyukov

What lock-free implementation do you have in mind?


there are many Lock Free queues implemented withCAS(compare and swap).does TBB developer consider that?
0 Kudos
Dmitry_Vyukov
Valued Contributor I
1,520 Views
Quoting - softarts
there are many Lock Free queues implemented withCAS(compare and swap).does TBB developer consider that?

First of all, TBB devs need not ANY lock-free queue, they need fast lock-free queue. Lock-freedom is not about performance and scalability. While TBB is. In some conditions some lock-based algorithms are indeed faster than lock-free algorithms. AFAIR, TBB's queue reduces contention on each micro-queue by a factor of 8 (probably have to be tunable parameter). Have you measured performance/scalability of TBB's queue againts lock-free implementation?
Then, some lock-free queues require element to be a pointer (single-word POD). TBB's queue supports arbitrary element types.
Then, some lock-free queues make additional copies of elements and do not support exceptions thrown from ctor/copy ctor of element. This will also badly cooperate with arbitrary element types.

Having said that, personally I would use different algorithm for mpmc-queue in TBB, not necessary lock-free but just different (with stricten requirements for element types).

0 Kudos
softarts
Beginner
1,520 Views
Quoting - Dmitriy Vyukov

First of all, TBB devs need not ANY lock-free queue, they need fast lock-free queue. Lock-freedom is not about performance and scalability. While TBB is. In some conditions some lock-based algorithms are indeed faster than lock-free algorithms. AFAIR, TBB's queue reduces contention on each micro-queue by a factor of 8 (probably have to be tunable parameter). Have you measured performance/scalability of TBB's queue againts lock-free implementation?
Then, some lock-free queues require element to be a pointer (single-word POD). TBB's queue supports arbitrary element types.
Then, some lock-free queues make additional copies of elements and do not support exceptions thrown from ctor/copy ctor of element. This will also badly cooperate with arbitrary element types.

Having said that, personally I would use different algorithm for mpmc-queue in TBB, not necessary lock-free but just different (with stricten requirements for element types).


in micro_queue,TBB use "pause/yield" to avoid concurrently accessing.
Ido not know well aboutefficiency of this instruction.
probably I can replace the micro_queue by Lock-Free queue to see the performance.
0 Kudos
Dmitry_Vyukov
Valued Contributor I
1,520 Views
Quoting - softarts
in micro_queue,TBB use "pause/yield" to avoid concurrently accessing.
Ido not know well aboutefficiency of this instruction.
probably I can replace the micro_queue by Lock-Free queue to see the performance.

Not only nature of the operation affects performance, but also frequency of its execution. Spinning must be very infrequent in TBB's queue, because load is uniformly distributed across all micro queues. Nth operation goes to (N % 8) micro queue, so 8 subsequent operations do not produce contention on micro queues at all (once again in the perfect world 8 must be statically/dynamically tunable, because 1024 processors still may produce substantial contention).

If you will implement lock-free queue and benchmark it against TBB queue it will be great, please post results here. Btw, how are going to solve safe memory reclamation problem?

0 Kudos
Dmitry_Vyukov
Valued Contributor I
1,520 Views
Quoting - Dmitriy Vyukov
Not only nature of the operation affects performance, but also frequency of its execution.

Forget to mention, local spinning/blocking is at least perfectly scalable :) as opposed to retrying atomic RMWs.

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,520 Views

Lock free queues typically require DCAS(double word compare and swap)for some portions of the code. Not all platforms support DCAS.

As an example, you can safely add a node to the head of a singly linked list using a CAS, but you cannot safely remove a node from a singly linked list using a single CAS. There is a well known situation called the ABA problem and the recommended solution is to use a DCAS on {pointer, sequence number}. True, you can mangle a pointer with a small-ish counter and get some protection but the risk of undetected ABA increases as you shorten the number of bits in the counter.

Are you experiencing a problem with locksinside the scheduler portion of TBB? If not, then add your own favorite Lock-Free queue or whatnot. TBB in general has diminished lock contention by design. The TBB thread pool, typically does not exceed the number of hardware threads. Lock contention becomes (severely)problematic with over subscription of threads. This typically is not the case with a TBB application. Now if you write a hybrid with TBB +a bazillion threads of your own then do not point the finger at TBB. There are far worse things to worry about such as brain dead scheduler code in the operating system (as opposed to the thread scheduler withing an application).

Jim Dempsey
0 Kudos
Dmitry_Vyukov
Valued Contributor I
1,520 Views

Lock free queues typically require DCAS(double word compare and swap)for some portions of the code. Not all platforms support DCAS.

As an example, you can safely add a node to the head of a singly linked list using a CAS, but you cannot safely remove a node from a singly linked list using a single CAS. There is a well known situation called the ABA problem and the recommended solution is to use a DCAS on {pointer, sequence number}. True, you can mangle a pointer with a small-ish counter and get some protection but the risk of undetected ABA increases as you shorten the number of bits in the counter.


If queue requires DCAS then it's also typically require TSM (type-stable memory). Not all systems provide TSM.
IMVHO, so called IBM-tagging (pointer+counter) is more of a hack, and hack only for part of the problem. What if consumer will free queue node? DCAS will just cause paging-fault.

And how I can organize efficient iteration over linked-list with DCAS?

IMVHO, the real solution is so called PDR (partial-copy-on-write deferred reclamatiom) techniques, which include SMR, RCU, VZOOM, PC, etc. They solve problem with paging-faults/memory-shooting in lock-free algos, and ABA problem fall away.

Btw, Microsoft uses 64-bit CAS for their SList API, on 32-bit platforms it's a DCAS, and on 64-bit platform it's just CAS. However they still pack 25-bit counter along with pointer. But along with DCAS/CAS they also use SEH to catch paging-faults.

0 Kudos
Reply