- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I am in need of a container that allows fast and frequent iteration, which would assume concurrent_vector, but also allows for frequent add / delete. All of this would be happening concurrently. It seems apparent why concurrent_vector doesn't support an erase / remove method, in order to maintain some performance characteristics (particularly, the fact that it would have to invalidate current iterators). Having said that, it seems an erase method could be improvised, especially if the thread erasing was willing to wait on iterating threads to complete.
Another option for erasing from a concurrent_vector might be to reuse individual elements if they are erased, without moving the rest of the array values around. Let's say you have a vector of 6 integers, 0, 1, 2, 3, 4, 5. And, let's say you want to erase element[3], which is also the number3. This element indexer (3) could be popped onto a concurrent_queue and element 3 marked "unused". Using another method, "push_next", for example, could attempt to pop an element off of the concurrent_queue if present, and reuse the element index. If no reusable element is available, it simply invokes push_back. So, to follow the train of thought, we have a vector:
Initial Values: 0, 1, 2, 3, 4, 5
After "erase(3)": 0, 1, 2, x, 4, 5 -- any iterators would know to skip over the placeholder for 3.
After "push_next(9)": 0, 1, 2, 9, 4, 5 -- notice that the 3rd element was reused
One way I can see this being implemented would be with a wrapper class that contains an additional boolean for each element. The wrapper element would indicate if the element is being used. That obviously increases memory footprint, and potentially decreases performance.
I am pursuing an option where I am writing a concurrent_list, which uses the same accessor-based semantics of concurrent_hash_map to insert / erase. I'll post my progress here, unless someone else has one already. Would be more than happy to share progress with you, if you do.
Regards,
-Jared
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Anton,
Arch's post was unrealistic to the extreme. One would generally not construct a linked list to contain a single integer per node. Instead, one would tend to create an array of integers (which is essentialy what the container is doing).
Typical nodes in a linked list have a few, to many member variables. Arch's example was an apples to oranges type of comparason, I'm surprised he posted that instead of something more realistic. BTW, how would Arch propose to do the RadixSort threading challenge without linked list? And if done without linked list, then what is the ratio of performance difference as compared to the linked list of single integer vs vector of integers?
Linked lists are generally NOT used when you allocate a bunch of things at the same time. Linked lists tend to be used when the allocations come about in an uncontrolled manner. At which point, cache alignment issues are a moot point. In my experience with parallel linked list in QuickThread, when the data within the node exceeds one cache line (e.g. 8 doubles), and the task is floating point intensive, then parallel processing of linked lists can be at timessuperscalar.
[cpp]Array size 4 SerialFPU... 29009763 SerialInt... 6019047 ParallelFPU(L0$)... 31029255 0.934917 ParallelInt(L0$)... 6329349 0.950974 ParallelFPU(L1$)... 31362606 0.924979 ParallelInt(L1$)... 6353127 0.947415 ParallelFPU(L2$)... 15890499 1.8256 ParallelInt(L2$)... 8049762 0.74773 ParallelFPU(OneEach_L2$)... 16983612 1.7081 ParallelInt(OneEach_L2$)... 5626710 1.06973 ParallelFPU(AllThreads$)... 9536679 3.04191 ParallelInt(AllThreads$)... 6488973 0.927581 Array size 8 SerialFPU... 62163234 SerialInt... 6421788 ParallelFPU(L0$)... 8301969 7.48777 ParallelInt(L0$)... 6682041 0.961052 ParallelFPU(L1$)... 8478117 7.3322 ParallelInt(L1$)... 6820056 0.941603 ParallelFPU(L2$)... 4735278 13.1277 ParallelInt(L2$)... 8619426 0.745037 ParallelFPU(OneEach_L2$)... 6094098 10.2006 ParallelInt(OneEach_L2$)... 5939757 1.08115 ParallelFPU(AllThreads$)... 4452543 13.9613 ParallelInt(AllThreads$)... 4425300 1.45115 [/cpp]
big number is ticks from _rdtsc
Note, AllThreads$ is running 4 thread on Q6600
L2$ is 2 threads in same L2 cache
OneEach_L2$ is 2 threads is seperate L2 caches.
L0$ and L1$ are single thread (run through task system)
The above is not a comprehensive test as the "computation" was a bogus expression to assure the work was not a simple memoryread ormemory to memory copy.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I forgot to point out, using 2 threads (with L2$) was about the same performance as using 4 threads. Therefore, if you were to experience this with your application, it would be recommended that you schedule only 2 threads (on Q6600). I may be running this test soon on a Core i7 (4-core with HT).
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have a hunch thatthe C++0x guys will tell you that RCU has undefined behaviour, so if you use it your computer may catch fire for all they care (luckily the O.S. will prevent anything much worse than, say,the corruption or deletion of all your files). If my impression is mistaken, I would certainly like to be informed why.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have a hunch thatthe C++0x guys will tell you that RCU has undefined behaviour, so if you use it your computer may catch fire for all they care (luckily the O.S. will prevent anything much worse than, say,the corruption or deletion of all your files). If my impression is mistaken, I would certainly like to be informed why.
http://www.rdrop.com/users/paulmck
He also had to testify in that SCO vs IBM lawsuit:
http://www.groklaw.net/articlebasic.php?story=20061028211523142
Anyway, I am wondering why you think that the C++0x guys would say that RCU has undefined behavior?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
http://www.rdrop.com/users/paulmck
He also had to testify in that SCO vs IBM lawsuit:
http://www.groklaw.net/articlebasic.php?story=20061028211523142
Anyway, I am wondering why you think that the C++0x guys would say that RCU has undefined behavior?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
RCU is very performance friendly in the sense that it can allow reader threads to traverse some data-structures without using any atomic operations, and (DEC Alpha aside for a moment) any expensive memory barriers. Its works extremely well for "read-mostly" access patterns. The readers can run concurrently while writers perform mutations to the data-structure. The RCU aspect of the algorithm is for writer threads only. The writer can performs a mutation by reading a node of a linked-list, making a new copy of said node, and then atomically inserting the new one, and then the deferring the destruction of the old node when its in a quiescent state. The writers can use whatever synchronization technique they need to perform this operation. The deferring of the node is handled by the RCU subsystem.
Here is some brief information on how efficient non-blocking reader patterns can effect performance:
http://webpages.charter.net/appcore/vzoom/round-1.pdf
(this pre-alpha experimental contest submission won me a brand new T200 server - https://coolthreads.dev.java.net)
http://www.rdrop.com/users/paulmck/RCU/lockperf.2004.01.17a.pdf
http://www.google.com/patents/about?id=c6ybAAAAEBAJ&dq=chris+thomasson
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sorry, but this is total nonsense. What part of RCU can not be implemented in C++0x? I've implemented several RCU-like algorithms in C++0x with 100% compliance with C++0x standard. No problems at all.
It does not matter that much that some operations are declared as undefined-behavior (UB) in C++0x, because those operations are no more than plain programming errors. You just do not have them in correct program, and C++0x provides means to implement everything you need (and probably even more, and definitely more than Java or C#) in correct way such that no UB in your program. Why do not you complain that dereferencing dangling pointer in C++03 causes strange undefined results? Well, just because it's C++ where is no mommy staying behind.
Java approach is definitely "safer", however that safety cames at cost. Every constructor call has to contain memory barrier "just in case". C++ take different road years ago, and if you don't like this then you just have to change language and not complain about C++.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Term RCU is quite misleading, because RCU is not about updates at all. It's about reads.
And I've demonstrated here how it can be performance friendly regarding reads:
http://software.intel.com/en-us/forums/showthread.php?t=60494
50x performance difference even on quad-core, not saying about larger machines.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Indeed.
Nope. There is std::memory_order_consume which allows reader thread to iterate over linked data structures w/o fences.
I think first implementations of C++0x atomics library we see will be quite naive and straightforward. However I can imagine future compilers which will do very agressive optimizations around fences - allow code to sink below acquire fence and hoist above release fence, allow code that operates on local data to move in any direction, fence coalescing, etc. So that C++0x implementation can be even faster.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Indeed.
Nope. There is std::memory_order_consume which allows reader thread to iterate over linked data structures w/o fences.
I think first implementations of C++0x atomics library we see will be quite naive and straightforward. However I can imagine future compilers which will do very agressive optimizations around fences - allow code to sink below acquire fence and hoist above release fence, allow code that operates on local data to move in any direction, fence coalescing, etc. So that C++0x implementation can be even faster.
WRT my comment about how a C++0x implementation would be less efficient than a native one was basically referring to how the readers would indicate to the polling logic that they are in a non-quiescent region. I think you could do it with something like:
[cpp]struct per_thread { std::atomicm_acquire; // = 0 std::atomic m_release; // = 0 }; void rcu_acquire() { per_thread* self = [...]; self->m_acquire.fetch_add(1, std::memory_order_acquire); } void rcu_release() { per_thread* self = [...]; self->m_release.fetch_add(1, std::memory_order_release); } void reader_iterate() { rcu_acquire(); [...]; rcu_release(); } [/cpp]
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
"Sorry, but this is total nonsense. What part of RCU can not be implemented in C++0x? I've implemented several RCU-like algorithms in C++0x with 100% compliance with C++0x standard. No problems at all."
Hey, luckily they were just self-declared "hunches" and "impressions", with requests to correct them if needed (although preferentially not in these exact terms). Learning something new once in a while is what we're here for (I'm assuming you've done your homework). I'll let those other remarks slide, because I don't want to redothat discussion.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
To the best of my knowledge, memory_order_consume must be no-op on everything except Alpha, because all data dependencies will be respected by compiler.
What do you mean referring to 'read_barrier_depends()'?
There is "example memory model implementation for POWER" by Paul McKenney:
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2008/n2745.html
at least there load-consume maps to plain 'ld'.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Ah that.
The main problem with coarse-grained user-space PDR in which readers may traverse linked-list using only plain load instructions is that thread may be preempted just before dereferencing deferred pointer. If this will happen system-wide memory reclamation will be stopped. Even if automatic *hardware-level* quiescent period detection will be used.
So application that uses such PDR must be designed in such a way that long-lasting preemptions must be impossible (i.e. does not oversubscribe threads, does not use blocking calls, etc). And if this is done, then one may use coarser-gained protected regions. For example, for server application infrastructure will protect whole IO callback, then worker thread may access any data-structures for reading w/o any overheads. 1 atomic RMW on *thread-local* location per IO callback is not too much (~30-50 cycles on modern Intel hardware).
There is another possibility. Assume there is global epoch counter 'int g_epoch', the counter is increment once in a while (when thread accomulates enough garbage). Worker threads periodically check for epoch promotion - if (g_epoch==thread_local_last_seen_epoch) then do nothing, otherwise do some work regarding epoch promotion. This way "per IO callback" overhead reduces down to 1 comparation of cached variables (in the common case when epoch is not changed).
Yes, this approach requires periodical activity of all threads. But my point is that efficient user-space PDR can not be built w/o that anyway.
Btw, how do you solve this problem in VZOOM? My guess is that you are using something similar to "SMR+RCU", i.e. while acquiring pointer thread makes some "critical store-load-check" sequence but w/o #StoreLoad in between:
void strong_acquire(void** p)
{
void* x = *p;
for (;;)
{
size_t h = hash(x);
thread_local_counters
void* x2 = *p; // LOAD
if (x == x2) // CHECK
return;
thread_local_counters
x = x2;
}
}
If so that this can not be called "zero overhead", only "virtually zero overhead" :)
Otherwise I do not see how you can resolve problem with thread preemted just before dereferencing pointer. GC thread does not know what pointer worker thread is going to dereference in the future, so he can not reclaim any memory at all.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Ah that.
The main problem with coarse-grained user-space PDR in which readers may traverse linked-list using only plain load instructions is that thread may be preempted just before dereferencing deferred pointer. If this will happen system-wide memory reclamation will be stopped. Even if automatic *hardware-level* quiescent period detection will be used.
So application that uses such PDR must be designed in such a way that long-lasting preemptions must be impossible (i.e. does not oversubscribe threads, does not use blocking calls, etc). And if this is done, then one may use coarser-gained protected regions. For example, for server application infrastructure will protect whole IO callback, then worker thread may access any data-structures for reading w/o any overheads. 1 atomic RMW on *thread-local* location per IO callback is not too much (~30-50 cycles on modern Intel hardware).
There is another possibility. Assume there is global epoch counter 'int g_epoch', the counter is increment once in a while (when thread accomulates enough garbage). Worker threads periodically check for epoch promotion - if (g_epoch==thread_local_last_seen_epoch) then do nothing, otherwise do some work regarding epoch promotion. This way "per IO callback" overhead reduces down to 1 comparation of cached variables (in the common case when epoch is not changed).
Yes, this approach requires periodical activity of all threads. But my point is that efficient user-space PDR can not be built w/o that anyway.
Btw, how do you solve this problem in VZOOM? My guess is that you are using something similar to "SMR+RCU", i.e. while acquiring pointer thread makes some "critical store-load-check" sequence but w/o #StoreLoad in between:
void strong_acquire(void** p)
{
void* x = *p;
for (;;)
{
size_t h = hash(x);
thread_local_counters
void* x2 = *p; // LOAD
if (x == x2) // CHECK
return;
thread_local_counters
x = x2;
}
}
If so that this can not be called "zero overhead", only "virtually zero overhead" :)
Otherwise I do not see how you can resolve problem with thread preemted just before dereferencing pointer. GC thread does not know what pointer worker thread is going to dereference in the future, so he can not reclaim any memory at all.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page