Intel® oneAPI Threading Building Blocks
Ask questions and share information about adding parallelism to your applications when using this threading library.

Concurrent List

jaredkeithwhite
New Contributor I
1,831 Views
Does anyone know why there isn't an implementation for a concurrent linked list within the TBB library? It seems this is a pretty large oversight.

I am in need of a container that allows fast and frequent iteration, which would assume concurrent_vector, but also allows for frequent add / delete. All of this would be happening concurrently. It seems apparent why concurrent_vector doesn't support an erase / remove method, in order to maintain some performance characteristics (particularly, the fact that it would have to invalidate current iterators). Having said that, it seems an erase method could be improvised, especially if the thread erasing was willing to wait on iterating threads to complete.

Another option for erasing from a concurrent_vector might be to reuse individual elements if they are erased, without moving the rest of the array values around. Let's say you have a vector of 6 integers, 0, 1, 2, 3, 4, 5. And, let's say you want to erase element[3], which is also the number3. This element indexer (3) could be popped onto a concurrent_queue and element 3 marked "unused". Using another method, "push_next", for example, could attempt to pop an element off of the concurrent_queue if present, and reuse the element index. If no reusable element is available, it simply invokes push_back. So, to follow the train of thought, we have a vector:

Initial Values: 0, 1, 2, 3, 4, 5
After "erase(3)": 0, 1, 2, x, 4, 5 -- any iterators would know to skip over the placeholder for 3.
After "push_next(9)": 0, 1, 2, 9, 4, 5 -- notice that the 3rd element was reused

One way I can see this being implemented would be with a wrapper class that contains an additional boolean for each element. The wrapper element would indicate if the element is being used. That obviously increases memory footprint, and potentially decreases performance.

I am pursuing an option where I am writing a concurrent_list, which uses the same accessor-based semantics of concurrent_hash_map to insert / erase. I'll post my progress here, unless someone else has one already. Would be more than happy to share progress with you, if you do.

Regards,
-Jared
0 Kudos
21 Replies
robert_jay_gould
Beginner
1,723 Views
Quoting - jaredkeithwhite
Does anyone know why there isn't an implementation for a concurrent linked list within the TBB library? It seems this is a pretty large oversight.


Although I agree it seems odd that there is no concurrent_list, normally I've been able to solve any problem with either vector, hash or queue. Won't your problem fit in one of these?
0 Kudos
Anton_Pegushin
New Contributor II
1,723 Views
Quoting - jaredkeithwhite
Does anyone know why there isn't an implementation for a concurrent linked list within the TBB library? It seems this is a pretty large oversight.
Hi,

Arch Robison posted a blog entry answering this exact question a while back. Please go ahead and read it along with the discussion in the blog comments, I remember I found it very informative. Here is the link"Linked Lists - Incompatible With Parallel Programming?"

And I think I like your own proposal with a wrapper around a concurrent_vector, but I'd use a concurrent_queue for "can be re-used" elements indeces. Push an index to the concurrent_queue, whenever you are "erasing" an element and pop_if_present, when trying to re-use. This should satisfy your problem statement and at the same time be concurrent. What do you think?


0 Kudos
jaredkeithwhite
New Contributor I
1,723 Views

Although I agree it seems odd that there is no concurrent_list, normally I've been able to solve any problem with either vector, hash or queue. Won't your problem fit in one of these?

Robert,

It turns out that for nearly 100% of my issues, I can use a concurrent_hash_map. In cases where it's not going to be ideal, I can still get by currently with a map. For my implementation, order is unimportant. But for ordered sets where you want to be able to concurrently add and remove, concurrent_vector doesn't fit the bill -- there is no remove facility.

The big question is one of iteration. I have an implementation that is FREQUENTLY iterated, and occasionally added/removed. By frequently, I mean about 15 times per second. My lists are very small (5-10 elements maximum). By occasionally add/remove, we would do this on the order of every 5 minutes for most, and every 30 seconds for others.

So I would definitely opt for fast iteration over fast add/remove. I'm just hoping that concurrent_hash_map has a quick iterator. If not, I'm going to be stuck writing my own wrapper around std::list or std::vector with read/write mutexes.

0 Kudos
jaredkeithwhite
New Contributor I
1,723 Views
Anton,

I think we're proposing something very similar, if not exact. For the sake of brevity:

template
class enhanced_concurrent_vector : public concurrent_vector< enhanced_wrapper >
- concurrent_queue reuse_indices;
- push_next(T)
- would "reuse_indices.pop_if_present()" and then push into the index if possible
- if no queue item is present, would invoke push_back or push_front (perhaps a method could be written for each option)
- erase(int)
- would set the wrapper to be empty, and push the index (would need to mutex lock these operations)

We would need to override the *operator for the concurrent_vector iterator(s), or perform similar operations in the enhanced_wrapper, so that iterators return the references to T, not enhanced_wrapper.


Perhaps when I get a week or so of free time I'll play around with this. Otherwise if I wind up having one of my guys write it, I'll submit it to the community for review.



Hi,

Arch Robison posted a blog entry answering this exact question a while back. Please go ahead and read it along with the discussion in the blog comments, I remember I found it very informative. Here is the link"Linked Lists - Incompatible With Parallel Programming?"

And I think I like your own proposal with a wrapper around a concurrent_vector, but I'd use a concurrent_queue for "can be re-used" elements indeces. Push an index to the concurrent_queue, whenever you are "erasing" an element and pop_if_present, when trying to re-use. This should satisfy your problem statement and at the same time be concurrent. What do you think?



0 Kudos
jimdempseyatthecove
Honored Contributor III
1,723 Views
Hi,

Arch Robison posted a blog entry answering this exact question a while back. Please go ahead and read it along with the discussion in the blog comments, I remember I found it very informative. Here is the link"Linked Lists - Incompatible With Parallel Programming?"

And I think I like your own proposal with a wrapper around a concurrent_vector, but I'd use a concurrent_queue for "can be re-used" elements indeces. Push an index to the concurrent_queue, whenever you are "erasing" an element and pop_if_present, when trying to re-use. This should satisfy your problem statement and at the same time be concurrent. What do you think?



Anton,

Arch's post was unrealistic to the extreme. One would generally not construct a linked list to contain a single integer per node. Instead, one would tend to create an array of integers (which is essentialy what the container is doing).

Typical nodes in a linked list have a few, to many member variables. Arch's example was an apples to oranges type of comparason, I'm surprised he posted that instead of something more realistic. BTW, how would Arch propose to do the RadixSort threading challenge without linked list? And if done without linked list, then what is the ratio of performance difference as compared to the linked list of single integer vs vector of integers?

Linked lists are generally NOT used when you allocate a bunch of things at the same time. Linked lists tend to be used when the allocations come about in an uncontrolled manner. At which point, cache alignment issues are a moot point. In my experience with parallel linked list in QuickThread, when the data within the node exceeds one cache line (e.g. 8 doubles), and the task is floating point intensive, then parallel processing of linked lists can be at timessuperscalar.
[cpp]Array size 4
SerialFPU... 29009763
SerialInt... 6019047
ParallelFPU(L0$)... 31029255  0.934917
ParallelInt(L0$)... 6329349  0.950974
ParallelFPU(L1$)... 31362606  0.924979
ParallelInt(L1$)... 6353127  0.947415
ParallelFPU(L2$)... 15890499  1.8256
ParallelInt(L2$)... 8049762  0.74773
ParallelFPU(OneEach_L2$)... 16983612  1.7081
ParallelInt(OneEach_L2$)... 5626710  1.06973
ParallelFPU(AllThreads$)... 9536679  3.04191
ParallelInt(AllThreads$)... 6488973  0.927581
Array size 8
SerialFPU... 62163234
SerialInt... 6421788
ParallelFPU(L0$)... 8301969  7.48777
ParallelInt(L0$)... 6682041  0.961052
ParallelFPU(L1$)... 8478117  7.3322
ParallelInt(L1$)... 6820056  0.941603
ParallelFPU(L2$)... 4735278  13.1277
ParallelInt(L2$)... 8619426  0.745037
ParallelFPU(OneEach_L2$)... 6094098  10.2006
ParallelInt(OneEach_L2$)... 5939757  1.08115
ParallelFPU(AllThreads$)... 4452543  13.9613
ParallelInt(AllThreads$)... 4425300  1.45115
[/cpp]

big number is ticks from _rdtsc
Note, AllThreads$ is running 4 thread on Q6600
L2$ is 2 threads in same L2 cache
OneEach_L2$ is 2 threads is seperate L2 caches.
L0$ and L1$ are single thread (run through task system)

The above is not a comprehensive test as the "computation" was a bogus expression to assure the work was not a simple memoryread ormemory to memory copy.

Jim Dempsey

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,723 Views

I forgot to point out, using 2 threads (with L2$) was about the same performance as using 4 threads. Therefore, if you were to experience this with your application, it would be recommended that you schedule only 2 threads (on Q6600). I may be running this test soon on a Core i7 (4-core with HT).

Jim Dempsey


0 Kudos
robert_jay_gould
Beginner
1,723 Views
Quoting - jaredkeithwhite

Robert,

It turns out that for nearly 100% of my issues, I can use a concurrent_hash_map. In cases where it's not going to be ideal, I can still get by currently with a map. For my implementation, order is unimportant. But for ordered sets where you want to be able to concurrently add and remove, concurrent_vector doesn't fit the bill -- there is no remove facility.

The big question is one of iteration. I have an implementation that is FREQUENTLY iterated, and occasionally added/removed. By frequently, I mean about 15 times per second. My lists are very small (5-10 elements maximum). By occasionally add/remove, we would do this on the order of every 5 minutes for most, and every 30 seconds for others.

So I would definitely opt for fast iteration over fast add/remove. I'm just hoping that concurrent_hash_map has a quick iterator. If not, I'm going to be stuck writing my own wrapper around std::list or std::vector with read/write mutexes.


Yeah I agree that tbb::concurrent_vector's non ability to delete is kind of a deal breaker at times. As for iteration indeed my needs have been biased towards inserts/removes/finds with very little need for iteration so yeah I haven't felt the lack of list, but in an iteration heavy situation (like yours) none of the current concurrent containers will be a good match.

But now that you have provided us with a very defined example of your use-case, I woulddefinitelyuse std::vector+guard. For such small collections there is no point in doing parallel iteration (overhead vs work) so you won't be needing parallel_for (and friends). Besides a concurrent container will have at least atomic overheads, that a std::vector won't, so overall you'll get better performance with a std::vector. Just be sure that all your inserts/removes get done outside of the iteration phase and you should be set.

good luck!

0 Kudos
Chris_M__Thomasson
New Contributor I
1,723 Views

[...] Besides a concurrent container will have at least atomic overheads, that a std::vector won't, so overall you'll get better performance with a std::vector. Just be sure that all your inserts/removes get done outside of the iteration phase and you should be set.

good luck!



Actually, you can achieve iterations of a concurrent collection, say linked-list, with virtually zero-overheads. No atomic operations and, DEC Alpha aside for a moment, no expensive membars as well. Think in terms of RCU...
0 Kudos
RafSchietekat
Valued Contributor III
1,723 Views

I have a hunch thatthe C++0x guys will tell you that RCU has undefined behaviour, so if you use it your computer may catch fire for all they care (luckily the O.S. will prevent anything much worse than, say,the corruption or deletion of all your files). If my impression is mistaken, I would certainly like to be informed why.

0 Kudos
Chris_M__Thomasson
New Contributor I
1,723 Views
Quoting - Raf Schietekat

I have a hunch thatthe C++0x guys will tell you that RCU has undefined behaviour, so if you use it your computer may catch fire for all they care (luckily the O.S. will prevent anything much worse than, say,the corruption or deletion of all your files). If my impression is mistaken, I would certainly like to be informed why.


Actually, one of the C++0x guys is an inventor of RCU:

http://www.rdrop.com/users/paulmck

He also had to testify in that SCO vs IBM lawsuit:

http://www.groklaw.net/articlebasic.php?story=20061028211523142

Anyway, I am wondering why you think that the C++0x guys would say that RCU has undefined behavior?
0 Kudos
robert_jay_gould
Beginner
1,723 Views

Actually, one of the C++0x guys is an inventor of RCU:

http://www.rdrop.com/users/paulmck

He also had to testify in that SCO vs IBM lawsuit:

http://www.groklaw.net/articlebasic.php?story=20061028211523142

Anyway, I am wondering why you think that the C++0x guys would say that RCU has undefined behavior?

My personal (hopefully incorrect/misguided) impression, and Raf's too as far as I know, is that the C++0x team has randomly declared 50% of multithreading solutions undefined just for the fun of it :)

In which case there is a 50% chance RCU could make your computer become self aware (for a sort of positive undefinedbehavior, once in a while).

Anyways RCU should be doable in a broad sense, but not sure if we'll get a std::rcu operation and without that, implementations like Linux's could be undefined (maybe/probably?), in C++0x.

Also I honestly can't imagine how RCU is very performance friendly unless the U phase outweighs the RC by a large factor, as it does in database-like transactions.

0 Kudos
Chris_M__Thomasson
New Contributor I
1,723 Views

My personal (hopefully incorrect/misguided) impression, and Raf's too as far as I know, is that the C++0x team has randomly declared 50% of multithreading solutions undefined just for the fun of it :)

In which case there is a 50% chance RCU could make your computer become self aware (for a sort of positive undefinedbehavior, once in a while).

Anyways RCU should be doable in a broad sense, but not sure if we'll get a std::rcu operation and without that, implementations like Linux's could be undefined (maybe/probably?), in C++0x.

Also I honestly can't imagine how RCU is very performance friendly unless the U phase outweighs the RC by a large factor, as it does in database-like transactions.



I think I might be able to code a user-space RCU-like algorithm using C++0x and still follow all the rules. However, it would not have automatic quiescent period detection, and it would be less efficient than a native implementation because there would have to be explicit memory barriers for reader threads.

RCU is very performance friendly in the sense that it can allow reader threads to traverse some data-structures without using any atomic operations, and (DEC Alpha aside for a moment) any expensive memory barriers. Its works extremely well for "read-mostly" access patterns. The readers can run concurrently while writers perform mutations to the data-structure. The RCU aspect of the algorithm is for writer threads only. The writer can performs a mutation by reading a node of a linked-list, making a new copy of said node, and then atomically inserting the new one, and then the deferring the destruction of the old node when its in a quiescent state. The writers can use whatever synchronization technique they need to perform this operation. The deferring of the node is handled by the RCU subsystem.

Here is some brief information on how efficient non-blocking reader patterns can effect performance:

http://webpages.charter.net/appcore/vzoom/round-1.pdf
(this pre-alpha experimental contest submission won me a brand new T200 server - https://coolthreads.dev.java.net)

http://www.rdrop.com/users/paulmck/RCU/lockperf.2004.01.17a.pdf

http://www.google.com/patents/about?id=c6ybAAAAEBAJ&dq=chris+thomasson

0 Kudos
Dmitry_Vyukov
Valued Contributor I
1,723 Views
My personal (hopefully incorrect/misguided) impression, and Raf's too as far as I know, is that the C++0x team has randomly declared 50% of multithreading solutions undefined just for the fun of it :)

In which case there is a 50% chance RCU could make your computer become self aware (for a sort of positive undefinedbehavior, once in a while).

Sorry, but this is total nonsense. What part of RCU can not be implemented in C++0x? I've implemented several RCU-like algorithms in C++0x with 100% compliance with C++0x standard. No problems at all.
It does not matter that much that some operations are declared as undefined-behavior (UB) in C++0x, because those operations are no more than plain programming errors. You just do not have them in correct program, and C++0x provides means to implement everything you need (and probably even more, and definitely more than Java or C#) in correct way such that no UB in your program. Why do not you complain that dereferencing dangling pointer in C++03 causes strange undefined results? Well, just because it's C++ where is no mommy staying behind.
Java approach is definitely "safer", however that safety cames at cost. Every constructor call has to contain memory barrier "just in case". C++ take different road years ago, and if you don't like this then you just have to change language and not complain about C++.

0 Kudos
Dmitry_Vyukov
Valued Contributor I
1,723 Views
Also I honestly can't imagine how RCU is very performance friendly unless the U phase outweighs the RC by a large factor, as it does in database-like transactions.


Term RCU is quite misleading, because RCU is not about updates at all. It's about reads.
And I've demonstrated here how it can be performance friendly regarding reads:
http://software.intel.com/en-us/forums/showthread.php?t=60494
50x performance difference even on quad-core, not saying about larger machines.

0 Kudos
Dmitry_Vyukov
Valued Contributor I
1,722 Views
I think I might be able to code a user-space RCU-like algorithm using C++0x and still follow all the rules.

Indeed.

and it would be less efficient than a native implementation because there would have to be explicit memory barriers for reader threads.

Nope. There is std::memory_order_consume which allows reader thread to iterate over linked data structures w/o fences.
I think first implementations of C++0x atomics library we see will be quite naive and straightforward. However I can imagine future compilers which will do very agressive optimizations around fences - allow code to sink below acquire fence and hoist above release fence, allow code that operates on local data to move in any direction, fence coalescing, etc. So that C++0x implementation can be even faster.

0 Kudos
Chris_M__Thomasson
New Contributor I
1,722 Views
Quoting - Dmitriy Vyukov

Indeed.

and it would be less efficient than a native implementation because there would have to be explicit memory barriers for reader threads.

Nope. There is std::memory_order_consume which allows reader thread to iterate over linked data structures w/o fences.
I think first implementations of C++0x atomics library we see will be quite naive and straightforward. However I can imagine future compilers which will do very agressive optimizations around fences - allow code to sink below acquire fence and hoist above release fence, allow code that operates on local data to move in any direction, fence coalescing, etc. So that C++0x implementation can be even faster.


Do you know if memory_order_consume boils down to a trailing `MEMBAR #LoadLoad' on a SPARC? Or is it there to mimic `read_barrier_depends()' on Linux?

WRT my comment about how a C++0x implementation would be less efficient than a native one was basically referring to how the readers would indicate to the polling logic that they are in a non-quiescent region. I think you could do it with something like:


[cpp]struct per_thread {
  std::atomic m_acquire; // = 0
  std::atomic m_release; // = 0
};


void rcu_acquire() {
  per_thread* self = [...];
  self->m_acquire.fetch_add(1, std::memory_order_acquire);
}


void rcu_release() {
  per_thread* self = [...];
  self->m_release.fetch_add(1, std::memory_order_release);
}


void reader_iterate() {
  rcu_acquire();
    [...];
  rcu_release();
}

[/cpp]

That would give the polling logic some critical information. I am not sure how this could be done in C++0x without using atomic RMW and acquire/release barriers. Perhaps I am missing something vital here.Of course, as you know, this aspect of the algorithm can be virtually free when your using platform specific automatic synchronization epoch detection techniques.
0 Kudos
RafSchietekat
Valued Contributor III
1,722 Views

"Sorry, but this is total nonsense. What part of RCU can not be implemented in C++0x? I've implemented several RCU-like algorithms in C++0x with 100% compliance with C++0x standard. No problems at all."
Hey, luckily they were just self-declared "hunches" and "impressions", with requests to correct them if needed (although preferentially not in these exact terms). Learning something new once in a while is what we're here for (I'm assuming you've done your homework). I'll let those other remarks slide, because I don't want to redothat discussion.

0 Kudos
Dmitry_Vyukov
Valued Contributor I
1,722 Views
Do you know if memory_order_consume boils down to a trailing `MEMBAR #LoadLoad' on a SPARC? Or is it there to mimic `read_barrier_depends()' on Linux?

To the best of my knowledge, memory_order_consume must be no-op on everything except Alpha, because all data dependencies will be respected by compiler.
What do you mean referring to 'read_barrier_depends()'?

There is "example memory model implementation for POWER" by Paul McKenney:
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2008/n2745.html
at least there load-consume maps to plain 'ld'.

0 Kudos
Dmitry_Vyukov
Valued Contributor I
1,722 Views
WRT my comment about how a C++0x implementation would be less efficient than a native one was basically referring to how the readers would indicate to the polling logic that they are in a non-quiescent region. I think you could do it with something like:


Ah that.
The main problem with coarse-grained user-space PDR in which readers may traverse linked-list using only plain load instructions is that thread may be preempted just before dereferencing deferred pointer. If this will happen system-wide memory reclamation will be stopped. Even if automatic *hardware-level* quiescent period detection will be used.
So application that uses such PDR must be designed in such a way that long-lasting preemptions must be impossible (i.e. does not oversubscribe threads, does not use blocking calls, etc). And if this is done, then one may use coarser-gained protected regions. For example, for server application infrastructure will protect whole IO callback, then worker thread may access any data-structures for reading w/o any overheads. 1 atomic RMW on *thread-local* location per IO callback is not too much (~30-50 cycles on modern Intel hardware).
There is another possibility. Assume there is global epoch counter 'int g_epoch', the counter is increment once in a while (when thread accomulates enough garbage). Worker threads periodically check for epoch promotion - if (g_epoch==thread_local_last_seen_epoch) then do nothing, otherwise do some work regarding epoch promotion. This way "per IO callback" overhead reduces down to 1 comparation of cached variables (in the common case when epoch is not changed).
Yes, this approach requires periodical activity of all threads. But my point is that efficient user-space PDR can not be built w/o that anyway.
Btw, how do you solve this problem in VZOOM? My guess is that you are using something similar to "SMR+RCU", i.e. while acquiring pointer thread makes some "critical store-load-check" sequence but w/o #StoreLoad in between:
void strong_acquire(void** p)
{
void* x = *p;
for (;;)
{
size_t h = hash(x);
thread_local_counters += 1; // STORE
void* x2 = *p; // LOAD
if (x == x2) // CHECK
return;
thread_local_counters -= 1;
x = x2;
}
}
If so that this can not be called "zero overhead", only "virtually zero overhead" :)

Otherwise I do not see how you can resolve problem with thread preemted just before dereferencing pointer. GC thread does not know what pointer worker thread is going to dereference in the future, so he can not reclaim any memory at all.

0 Kudos
Chris_M__Thomasson
New Contributor I
1,462 Views
Quoting - Dmitriy Vyukov

Ah that.
The main problem with coarse-grained user-space PDR in which readers may traverse linked-list using only plain load instructions is that thread may be preempted just before dereferencing deferred pointer. If this will happen system-wide memory reclamation will be stopped. Even if automatic *hardware-level* quiescent period detection will be used.
So application that uses such PDR must be designed in such a way that long-lasting preemptions must be impossible (i.e. does not oversubscribe threads, does not use blocking calls, etc). And if this is done, then one may use coarser-gained protected regions. For example, for server application infrastructure will protect whole IO callback, then worker thread may access any data-structures for reading w/o any overheads. 1 atomic RMW on *thread-local* location per IO callback is not too much (~30-50 cycles on modern Intel hardware).
There is another possibility. Assume there is global epoch counter 'int g_epoch', the counter is increment once in a while (when thread accomulates enough garbage). Worker threads periodically check for epoch promotion - if (g_epoch==thread_local_last_seen_epoch) then do nothing, otherwise do some work regarding epoch promotion. This way "per IO callback" overhead reduces down to 1 comparation of cached variables (in the common case when epoch is not changed).
Yes, this approach requires periodical activity of all threads. But my point is that efficient user-space PDR can not be built w/o that anyway.
Btw, how do you solve this problem in VZOOM? My guess is that you are using something similar to "SMR+RCU", i.e. while acquiring pointer thread makes some "critical store-load-check" sequence but w/o #StoreLoad in between:
void strong_acquire(void** p)
{
void* x = *p;
for (;;)
{
size_t h = hash(x);
thread_local_counters += 1; // STORE
void* x2 = *p; // LOAD
if (x == x2) // CHECK
return;
thread_local_counters -= 1;
x = x2;
}
}
If so that this can not be called "zero overhead", only "virtually zero overhead" :)

Otherwise I do not see how you can resolve problem with thread preemted just before dereferencing pointer. GC thread does not know what pointer worker thread is going to dereference in the future, so he can not reclaim any memory at all.


Thats pretty much it. I use automatic epoch detection on the polling side.



0 Kudos
Reply