Intel® oneAPI Threading Building Blocks
Ask questions and share information about adding parallelism to your applications when using this threading library.
2465 Discussions

Questions on TBB Concurrent Queue Implementation

Shankar1
Beginner
2,184 Views
I went through the Concurrent Queue class in TBB and I havea fewquestions :

1. The concurrent_queue_rep class makes sure that head_counter and tail_counter are placed on seperate cachelines
so that the pushing threads and popping threads dont contendwhile accessing them.Also the consecutive operations to the queue aredirected to different microqueues so that these operations can be done quite independently. The size of each micro_queue happens to be 20 bytes. And the concurrent_queue_rep stores all the 8 microqueues consequetively
in memory(starting from the start of a cacheline).

So my question is, though the 8 concurrent operations are directed to 8 different micro queues, all the 8 threads are trying to access memorysome whereamong this 160 bytes ( 8 microqueues * 20 bytes) andI see quite some contention. So why not keep the each of the microqueues on different cachelines as well?

2. Idonot understand as to why the field 'mask' in the page struct is used to decide the success of a popping operation.
Because I see the operations SpinwaitUntilEq( head_counter, k ) and SpinwaitWhileEq( tail_counter, k ) in micro_queue::pop() method makesure that an itemis present for the popping threadto pop and hence I dont find the use of mask further to determine if the item is present.

3. The following is the code in the internal_pop methos of concurrent_queue_base
do {
k = r.head_counter++;
} while( !r.choose(k).pop(dst,k,*this) );

Why is the head_counter incremented in a do-while loop? How will a thread which doesnot find a item for pop operation N find an item for an an operation (N+1)?

4. If the threads that push into the concurrent queue and the threads that pop from the concurrent queue are different then the threads which pop from the queue will be the threads who call deallocate_page(). This again calls for a slow path in the scalable_free since the popping threads will never be the owners of the page memory. So why not have something like the LIFOQueue( used in scalable allocator code) per microqueueto place the free pages and get it back when necessary. Would this not reduce contentionsin the scalable_allocator as well?

Please correct me if my understanding on the above is wrong.
0 Kudos
23 Replies
RafSchietekat
Valued Contributor III
152 Views
#19 "The API is called semaphore and supported by both POSIX and Win32."
Yes, but the details vary: I was thinking of using the number as a hint, non-coalescing multiple signal() calls would wake up at least the specified number of threads, and semaphores wake up exactly that number of threads (conceptually, anyway). The "exactly" part seems potentially more costly, and, not being there from the beginning (if I remember correctly), wouldn't semaphores perhaps be implemented on top of something else and subject to the thundering herd problem anyway? But it seems plausible that thinking beyond semaphores and dealing with specific threads is better, and I now think that this is what you meant with "What is better: single call to broadcast() or 4 calls to signal()? I don't know.".

#19 "I don't think signal() calls can be coalesced."
Hey, why did my program suddenly stop working on that new platform? :-)

Jim, you will find in me a captivated audience.
0 Kudos
Dmitry_Vyukov
Valued Contributor I
152 Views
Quoting - Raf Schietekat
#19 "The API is called semaphore and supported by both POSIX and Win32."
Yes, but the details vary: I was thinking of using the number as a hint, non-coalescing multiple signal() calls would wake up at least the specified number of threads, and semaphores wake up exactly that number of threads (conceptually, anyway). The "exactly" part seems potentially more costly, and, not being there from the beginning (if I remember correctly) wouldn't semaphores perhaps be implemented on top of something else and subject to the thundering herd problem anyway?

Just now I do not see why I will be more costly... if we do not mean "always one" or "always zero" behind "at least N". If the promitive will be frankly trying to wake up N threads then the cost must be the same I think.
Also note that if producer wants to wake up 4 threads because he just submited 4 items to the queue, then if primitive will wake up only 3 threads this may be suboptimal in itself, because 1 item will be delayed. I.e. in some cases you want "exactly N" semantics.





0 Kudos
Dmitry_Vyukov
Valued Contributor I
152 Views
a) is it more efficient to signal the desired number of each waiting thread via individual Events
or
b) is it more efficient to have a permutation of Events
e.g.
64 single events
2 x 32 x 32 32-thread events
4 x 16 x 16 x 16 x 16 16 thread events


c) all threads are waiting on single event (as it now implemented in subj)

0 Kudos
Reply