- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I went through the Concurrent Queue class in TBB and I havea fewquestions :
1. The concurrent_queue_rep class makes sure that head_counter and tail_counter are placed on seperate cachelines
so that the pushing threads and popping threads dont contendwhile accessing them.Also the consecutive operations to the queue aredirected to different microqueues so that these operations can be done quite independently. The size of each micro_queue happens to be 20 bytes. And the concurrent_queue_rep stores all the 8 microqueues consequetively
in memory(starting from the start of a cacheline).
So my question is, though the 8 concurrent operations are directed to 8 different micro queues, all the 8 threads are trying to access memorysome whereamong this 160 bytes ( 8 microqueues * 20 bytes) andI see quite some contention. So why not keep the each of the microqueues on different cachelines as well?
2. Idonot understand as to why the field 'mask' in the page struct is used to decide the success of a popping operation.
Because I see the operations SpinwaitUntilEq( head_counter, k ) and SpinwaitWhileEq( tail_counter, k ) in micro_queue::pop() method makesure that an itemis present for the popping threadto pop and hence I dont find the use of mask further to determine if the item is present.
3. The following is the code in the internal_pop methos of concurrent_queue_base
do {
k = r.head_counter++;
} while( !r.choose(k).pop(dst,k,*this) );
Why is the head_counter incremented in a do-while loop? How will a thread which doesnot find a item for pop operation N find an item for an an operation (N+1)?
4. If the threads that push into the concurrent queue and the threads that pop from the concurrent queue are different then the threads which pop from the queue will be the threads who call deallocate_page(). This again calls for a slow path in the scalable_free since the popping threads will never be the owners of the page memory. So why not have something like the LIFOQueue( used in scalable allocator code) per microqueueto place the free pages and get it back when necessary. Would this not reduce contentionsin the scalable_allocator as well?
Please correct me if my understanding on the above is wrong.
1. The concurrent_queue_rep class makes sure that head_counter and tail_counter are placed on seperate cachelines
so that the pushing threads and popping threads dont contendwhile accessing them.Also the consecutive operations to the queue aredirected to different microqueues so that these operations can be done quite independently. The size of each micro_queue happens to be 20 bytes. And the concurrent_queue_rep stores all the 8 microqueues consequetively
in memory(starting from the start of a cacheline).
So my question is, though the 8 concurrent operations are directed to 8 different micro queues, all the 8 threads are trying to access memorysome whereamong this 160 bytes ( 8 microqueues * 20 bytes) andI see quite some contention. So why not keep the each of the microqueues on different cachelines as well?
2. Idonot understand as to why the field 'mask' in the page struct is used to decide the success of a popping operation.
Because I see the operations SpinwaitUntilEq( head_counter, k ) and SpinwaitWhileEq( tail_counter, k ) in micro_queue::pop() method makesure that an itemis present for the popping threadto pop and hence I dont find the use of mask further to determine if the item is present.
3. The following is the code in the internal_pop methos of concurrent_queue_base
do {
k = r.head_counter++;
} while( !r.choose(k).pop(dst,k,*this) );
Why is the head_counter incremented in a do-while loop? How will a thread which doesnot find a item for pop operation N find an item for an an operation (N+1)?
4. If the threads that push into the concurrent queue and the threads that pop from the concurrent queue are different then the threads which pop from the queue will be the threads who call deallocate_page(). This again calls for a slow path in the scalable_free since the popping threads will never be the owners of the page memory. So why not have something like the LIFOQueue( used in scalable allocator code) per microqueueto place the free pages and get it back when necessary. Would this not reduce contentionsin the scalable_allocator as well?
Please correct me if my understanding on the above is wrong.
Link Copied
- « Previous
-
- 1
- 2
- Next »
23 Replies
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
#19 "The API is called semaphore and supported by both POSIX and Win32."
Yes, but the details vary: I was thinking of using the number as a hint, non-coalescing multiple signal() calls would wake up at least the specified number of threads, and semaphores wake up exactly that number of threads (conceptually, anyway). The "exactly" part seems potentially more costly, and, not being there from the beginning (if I remember correctly), wouldn't semaphores perhaps be implemented on top of something else and subject to the thundering herd problem anyway? But it seems plausible that thinking beyond semaphores and dealing with specific threads is better, and I now think that this is what you meant with "What is better: single call to broadcast() or 4 calls to signal()? I don't know.".
#19 "I don't think signal() calls can be coalesced."
Hey, why did my program suddenly stop working on that new platform? :-)
Jim, you will find in me a captivated audience.
Yes, but the details vary: I was thinking of using the number as a hint, non-coalescing multiple signal() calls would wake up at least the specified number of threads, and semaphores wake up exactly that number of threads (conceptually, anyway). The "exactly" part seems potentially more costly, and, not being there from the beginning (if I remember correctly), wouldn't semaphores perhaps be implemented on top of something else and subject to the thundering herd problem anyway? But it seems plausible that thinking beyond semaphores and dealing with specific threads is better, and I now think that this is what you meant with "What is better: single call to broadcast() or 4 calls to signal()? I don't know.".
#19 "I don't think signal() calls can be coalesced."
Hey, why did my program suddenly stop working on that new platform? :-)
Jim, you will find in me a captivated audience.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - Raf Schietekat
#19 "The API is called semaphore and supported by both POSIX and Win32."
Yes, but the details vary: I was thinking of using the number as a hint, non-coalescing multiple signal() calls would wake up at least the specified number of threads, and semaphores wake up exactly that number of threads (conceptually, anyway). The "exactly" part seems potentially more costly, and, not being there from the beginning (if I remember correctly) wouldn't semaphores perhaps be implemented on top of something else and subject to the thundering herd problem anyway?
Yes, but the details vary: I was thinking of using the number as a hint, non-coalescing multiple signal() calls would wake up at least the specified number of threads, and semaphores wake up exactly that number of threads (conceptually, anyway). The "exactly" part seems potentially more costly, and, not being there from the beginning (if I remember correctly) wouldn't semaphores perhaps be implemented on top of something else and subject to the thundering herd problem anyway?
Just now I do not see why I will be more costly... if we do not mean "always one" or "always zero" behind "at least N". If the promitive will be frankly trying to wake up N threads then the cost must be the same I think.
Also note that if producer wants to wake up 4 threads because he just submited 4 items to the queue, then if primitive will wake up only 3 threads this may be suboptimal in itself, because 1 item will be delayed. I.e. in some cases you want "exactly N" semantics.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - jimdempseyatthecove
a) is it more efficient to signal the desired number of each waiting thread via individual Events
or
b) is it more efficient to have a permutation of Events
e.g.
64 single events
2 x 32 x 32 32-thread events
4 x 16 x 16 x 16 x 16 16 thread events
or
b) is it more efficient to have a permutation of Events
e.g.
64 single events
2 x 32 x 32 32-thread events
4 x 16 x 16 x 16 x 16 16 thread events
c) all threads are waiting on single event (as it now implemented in subj)
Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page
- « Previous
-
- 1
- 2
- Next »