- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello everybody,
I was wondering if there is any plan to add a concurrent stack to tbb? Indeed, a concurrent_queue is almost always the right way to pass some data from one thread to another one, but sometimes we would like to fetch the most recently added data.
Thanks!
I was wondering if there is any plan to add a concurrent stack to tbb? Indeed, a concurrent_queue is almost always the right way to pass some data from one thread to another one, but sometimes we would like to fetch the most recently added data.
Thanks!
Link Copied
50 Replies
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
That's what happens without synchronisation: R asks a question (reading arbitrary memory locations), D sees it, R removes it, D still answers it.![smiley [:-)]](/file/6746)
Thanks all the same, but the idea was to directly probe some location from an unmanageable number (many many thousands), and I don't suppose any of these techniques would improve over simply using pthread_sigmask() (the thing was that I had not looked beyond process-global signal-related operations)? But I may still need to dig deeper if that function is too expensive.
Thanks all the same, but the idea was to directly probe some location from an unmanageable number (many many thousands), and I don't suppose any of these techniques would improve over simply using pthread_sigmask() (the thing was that I had not looked beyond process-global signal-related operations)? But I may still need to dig deeper if that function is too expensive.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Raf_Schietekat:
Hmm... Thread 1 and 2 see A, thread 1 takes it, thread 3 takes B, thread 1 pushes A, thread 2 wreaks havoc by not allowing for a pretty straightforward ABA? This can even happen with memory reclamation, it would just be less likely that some thread would recycle the memory and push it on the stack just at the wrong time, but not impossible. What am I missing?
You are missing that it's *safe* memory reclamation, not just memory reclamation :)
If thread 2 sees A, than it has acquired A, than A can't be recycled through memory reclamation system, than thread 1 can't push A back to stack.
Safe memory reclamation gives the same guarantees as GC in modern languages. I.e. thread can't re-allocate object, while it's still in use.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Raf_Schietekat:
That's what happens without synchronisation: R asks a question (reading arbitrary memory locations), D sees it, R removes it, D still answers it.
Thanks all the same, but the idea was to directly probe some location from an unmanageable number (many many thousands), and I don't suppose any of these techniques would improve over simply using pthread_sigmask() (the thing was that I had not looked beyond process-global signal-related operations)? But I may still need to dig deeper if that function is too expensive.
They DO improve!
You are missing one more moment. You can use pthread_sigmask() (or __try/__catch) to solve the problem ONLY IN STACK algorithm. You can't solve the problem with pthread_sigmask() in, for example, queue algorithm and in other algorithms. Because in queue algorithm there are 2 kinds of hazards: (1) thread accesses potentially freed memory, (2) thread MODIFIES potentially freed memory. With pthread_sigmask() you can solve only problem (1). But not problem (2). So you will end up with threads sometimes modifiy random memory locations! In stack algorithm there is only problem (1).
Safe memory reclamation techniques (other names are PDR and GC) solves BOTH problems.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Merging two subthreads in one reply:
If thread 1 has to get clearance first before reusing A (most likely getting a different piece of memory instead), that would indeed work.
I didn't want to show my cards yet before I was reasonably certain they could form a winning hand, but my purpose with random memory probing is to look for magic values in hypothetical headers (relates to "hazard 1") and so amortise any expensive disambiguation operations needed before making a final decision (relates to "hazard 2"), unrestricted to any particular sequence of events. It seems unlikely, even before timing the calls, but maybe some insight will come out of playing with the idea some more.
If thread 1 has to get clearance first before reusing A (most likely getting a different piece of memory instead), that would indeed work.
I didn't want to show my cards yet before I was reasonably certain they could form a winning hand, but my purpose with random memory probing is to look for magic values in hypothetical headers (relates to "hazard 1") and so amortise any expensive disambiguation operations needed before making a final decision (relates to "hazard 2"), unrestricted to any particular sequence of events. It seems unlikely, even before timing the calls, but maybe some insight will come out of playing with the idea some more.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Raf_Schietekat:
If thread 1 has to get clearance first before reusing A (most likely getting a different piece of memory instead), that would indeed work.
Indeed. Main question is how to do this.
Raf_Schietekat:
I didn't want to show my cards yet before I was reasonably certain they could form a winning hand, but my purpose with random memory probing is to look for magic values in hypothetical headers (relates to "hazard 1") and so amortise any expensive disambiguation operations needed before making a final decision (relates to "hazard 2"), unrestricted to any particular sequence of events. It seems unlikely, even before timing the calls, but maybe some insight will come out of playing with the idea some more.
It's difficult to completely eliminate window of inconsistency between probe and final CAS.
STMs can do this, but they have own problems and limitations.
Btw, you can use Relacy Race Detector:
http://groups.google.com/group/relacy
to validate your algorithm. Relacy is designed exactly for such tasks. I've validated several GC algorithms with Relacy, and it always finds some extremely subtle algorithmic errors, and incorrect placement of memory fences.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Raf_Schietekat:
But I would be interested in a safe way to read arbitrary memory locations where no such guarantee exists, because I've been struggling with that question this week...
Btw, I'm going to post here in near future my new proxy-collector algorithm and concurrent hash map algorithm which uses proxy-collector reclamation.
Preliminary benchmark results on Q6600 are following. On modest write workload hash map beats TBB's hash map by a factor of 20. On mostly read-only workload hash map beats TBB's hash map by a factor of 50. At least on 4 cores scaling is linear as opposed to TBB's hash map.
One of the reasons for such great scaling is exactly amortized proxy-collector algorithm.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
That's ridiculous. You are hereby forbidden to be that much faster.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Raf_Schietekat:
That's ridiculous. You are hereby forbidden to be that much faster.
Well, until I post the code, you can't refute my point... and I can't prove my point too :)
I have to make some minor changes, conduct alpha-tests, and make final benchmarks. Then I will post the code. And we will be able to resolve the dispute ;)
Frankly, until this moment I was testing hash map w/o memory reclamation, i.e. memory was just not freed at all. Now I am bolting on proxy-collector. But I hope that proxy-collector will have minor influence on fast-path.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Alexandre, just out of curiosity: how would you use this stack and what are the desired performance characteristics, considering that it may be implemented as non-blocking on most architectures but will never scale well if implemented naively to always push or pop just one item (which may even become a bottleneck for your application)?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Raf_Schietekat:
Alexandre, just out of curiosity: how would you use this stack and what are the desired performance characteristics, considering that it may be implemented as non-blocking on most architectures but will never scale well if implemented naively to always push or pop just one item (which may even become a bottleneck for your application)?
Alexandre, also take into account that many things, that are not relevant in single-threaded context, becomes relevant in multi-threaded context. For example: how many writer threads there are? how many threads access 'head' of the structure? how many threads access 'tail' of the structure? do you need to know the size of the structure? etc.
Addition (remove) of any of such requirements can dramatically decrease (increase) performance/scalability of data structure.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
randomizer:
Alexandre, also take into account that many things, that are not relevant in single-threaded context, becomes relevant in multi-threaded context. For example: how many writer threads there are? how many threads access 'head' of the structure? how many threads access 'tail' of the structure? do you need to know the size of the structure? etc.
Addition (remove) of any of such requirements can dramatically decrease (increase) performance/scalability of data structure.
For example here you can see extremely efficient implementation of single-producer/single-consumer lifo stack. It uses no atomic RMW operations nor heavy memory fences in push() and pop():
http://groups.google.com/group/lock-free/browse_frm/thread/4a410afb609e2ee1
Well, it's as efficient as lifo structure can be in principle. Lifo structure will always be substantially less scalable than fifo structure, just because it by definition assumes constant contention on one end.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Raf_Schietekat:
That's ridiculous. You are hereby forbidden to be that much faster.
You are free to test it:
/en-us/forums/showthread.php?t=60494
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
My needs are quite simple. All I want is to have a concurrent container that favors a First-In Last-Out order as it will avoid, for the project I currently working on, some unnecessary work. So it means that I don't care if it doesn't strictly follow this order.
For now, I can use the concurrent_queue and my application behaves correctly, but it could be better.
As for the performance characteristics, to be honest, I'm just an end-user, I want it to have a minimal overhead et to perform as fast as possible:-)
For now, I can use the concurrent_queue and my application behaves correctly, but it could be better.
As for the performance characteristics, to be honest, I'm just an end-user, I want it to have a minimal overhead et to perform as fast as possible:-)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Is the same set of threads doing the pushing and the popping? Or are there separate pusher and popper threads?
- Arch
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
alexandre.hamez:
My needs are quite simple. All I want is to have a concurrent container that favor a First-In Last-Out order as it will avoid, for the project I currently working on, some unnecessary work. So it means that I don't care if it doesn't strictly follow this order. For now, I can use the concurrent_queue and my application behaves correctly, but it could be better. As for the performance characteristics, to be honest, I'm just an end-user, I want it to have a minimal overhead et to perform as fast as possible:-)
Devil in the details!
Full fledged multi-producer/multi-consumer stack is inherently slow.
If multi-producer/single-consumer stack is enough, then it can be implemented more efficiently.
If single-producer/single-consumer stack is enough, then it can be implemented even more efficiently.
If you don't need size() function, and is_empty() is enough, then it can be implemented even more efficiently.
If you are going to throw full-load of threads on it, then completely different algorithm must employed:
http://www.cs.tau.ac.il/~shanir/nir-pubs-web/Papers/Lock_Free.pdf
And so on...
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The details:
- Multiple producers/consumers (2-8 threads)
- pop_if_present() and push() are all I need.
- The size will grow to at most 500-1000 elements
I didn't think a concurrent stack was so complicated. I though it was just a matter of adapting a concurrent queue.
- Multiple producers/consumers (2-8 threads)
- pop_if_present() and push() are all I need.
- The size will grow to at most 500-1000 elements
I didn't think a concurrent stack was so complicated. I though it was just a matter of adapting a concurrent queue.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
alexandre.hamez:
The details: - Multiple producers/consumers (2-8 threads) - pop_if_present() and push() are all I need. - The size will grow to at most 500-1000 elements I didn't think a concurrent stack was so complicated. I though it was just a matter of adapting a concurrent queue.
Hmm... I think that in this situation it's difficult to create something substantially better than plain old mutex based stack...
Potentially producer side can be made lock-free, and consumer side still will be mutex based. I am not sure whether it's worth doing.
If strict lifo is not required than stack can be made distributed. I.e., for example, 4 different lifo stacks, producer and consumer choose stack at random on every operation. This will improve scaling on high load. But it's definitely not a general-purpose solution suitable for the library...
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Even without being substantially better (under favourable circumstances), a self-respecting concurrent stack building block would use atomic operations instead of a mutex to prevent convoying.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Raf_Schietekat:
Even without being substantially better (under favourable circumstances), a self-respecting concurrent stack building block would use atomic operations instead of a mutex to prevent convoying.
How are you going to solve problem with safe memory reclamation wrt stack's pop() function?
You can end up being not "Even without being substantially better", but "being substantially worse", because all current state-of-the-art bullet-proof memory reclamation algorithms require one or more atomic RMW/#StoreLoad style membar per operation. Compare it with lock-based stack, which has only 1 XCHG per operation and devilishly simple.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Why would a lock-based stack have any less of a memory management burden than an atomic-based stack?
Anyway, since std::vector is not expected to shrink back from its high-water mark either, I would just push popped elements onto an internal reuse stack.
Anyway, since std::vector is not expected to shrink back from its high-water mark either, I would just push popped elements onto an internal reuse stack.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Raf_Schietekat:
Why would a lock-based stack have any less of a memory management burden than an atomic-based stack?
Because in lock-based solution you can't get SIGSEGV while dereferencing pointer.
Raf_Schietekat:
Anyway, since std::vector is not expected to shrink back from its high-water mark either, I would just push popped elements onto an internal reuse stack.
Ok. It's a kind of memory management scheme too. So you end up having 2 atomic RMW... 2 atomic double-word RMW per operation. And in lock-based stack you would have 1 XCHG per operation.
Also how are you going to organize internal reuse stacks? Is they will be per user stack, per thread, per processor, global?
Btw, this solution will work only for stack algo, not for queues and other.

Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page