Intel® oneAPI Threading Building Blocks
Ask questions and share information about adding parallelism to your applications when using this threading library.

Concurrent stack?

alexandre_hamez
Beginner
5,735 Views
Hello everybody,

I was wondering if there is any plan to add a concurrent stack to tbb? Indeed, a concurrent_queue is almost always the right way to pass some data from one thread to another one, but sometimes we would like to fetch the most recently added data.

Thanks!
0 Kudos
50 Replies
RafSchietekat
Valued Contributor III
820 Views
Thanks for teaching me that atomic operations are more expensive than locking and that locking implementations can do dynamic memory management for free.

0 Kudos
Dmitry_Vyukov
Valued Contributor I
820 Views
Raf_Schietekat:
Thanks for teaching me that atomic operations are more expensive than locking and that locking implementations can do dynamic memory management for free.


Sorry.
I've just been there done that, so to say. There are 'lock-free' algorithms that are better than locking ones wrt performance/scalability. But it's just not mpmc stack.

Btw! Do you aware of implementation of mpmc lock-free stack in Win32's InterlockedSList API?
It really has good performance... but some people are saying that it's way too dirty :)
It's intrusive stack, which always knows own size. On 32 bit machines it uses double-word CAS, on 64-bit machines it uses single-word CAS. On 32-bit machine anchor layout looks like this:
struct stack_anchor_32
{
 void* head;
 unsigned size : 16;
 unsigned aba_counter : 16;
};
On 64-bit machine it looks this way:
struct stack_anchor_64
{
 uint64_t head : 39; // extremely compressed pointer
 unsigned size : 16;
 unsigned aba_counter : 9;

};
To prevent accesses to freed memory they use following trick:

void* pop()
{
 void* result;
 int av_counter = 0;
retry:
 __try
 {
 result = pop_as_usual();
 }
 __except (if_access_violation)
 {
 if (++av_counter < 2000)
 goto retry;
 else
 rethrow_exception_to_user;
 }
 return result;
}

As practice shows, this algorithm has good performance as compared to other stacks, which not use "SEH trick".
But it's a kind of heavy OS dependent, in Unix environment you have to use signals here.


0 Kudos
RafSchietekat
Valued Contributor III
820 Views
Sorry, but I don't follow you anymore. Now you say CAS is slow, now you don't... BTW, catching SIGSEGV is exactly what I'm proposing with __TBB_USE_MAGIC in the latest memory allocator-related "Additions to atomic" patch (rest of the story pending validation of that technique because it will probably be a decidedly non-trivial amount of work).

0 Kudos
Dmitry_Vyukov
Valued Contributor I
820 Views
Raf_Schietekat:

Sorry, but I don't follow you anymore. Now you say CAS is slow, now you don't...


Sometimes I can't follow myself too :)
Assume we have 2 stack implementations. Former is blocking, but uses 1 atomic RMW per operation. Latter is wait-free, but uses 2 atomic RMW per operation. I can say that former is better, and I can say that latter is better I can say that former has some drawbacks as compared to latter, and I can say that latter has some drawbacks as compared to former. And all statements are true :) Which is better in general-case? I don't know. I don't even sure that such general-case wrt multi-threading exists...
But besides run-time properties lock-free algorithms also have inevitably associated much greater complexity of implementation. This fact further complicates problem with determining "the best" implementation.


0 Kudos
Dmitry_Vyukov
Valued Contributor I
820 Views
Raf_Schietekat:

BTW, catching SIGSEGV is exactly what I'm proposing with __TBB_USE_MAGIC in the latest memory allocator-related "Additions to atomic" patch (rest of the story pending validation of that technique because it will probably be a decidedly non-trivial amount of work).


Provided that it has decidedly non-trivial amount of work and works only for stack, it looks like it's not worth doing. Or you are going to somehow make it work for other data structures (queues, lists, maps etc) too? If so, I am very curious how exactly you are going to make it work for other data structures too. Can you elaborate, please?


0 Kudos
RafSchietekat
Valued Contributor III
820 Views
Just have a look at the latest "Additions to atomic" patch. I'm looking for somebody to validate or dismiss test_malloc_compliance.exe as representative of a real program (I have not yet taken the time myself): if it can be reproduced that __TBB_USE_MAGIC has no significant overhead in a real program (anyone from Intel?), I can be more confident not to be wasting my time tying up the loose ends, which would save a not altogether insignificant amount of real and/or virtual memory on top of any benefit from the other proposed changes.

0 Kudos
Dmitry_Vyukov
Valued Contributor I
820 Views
Raf_Schietekat:
Just have a look at the latest "Additions to atomic" patch. I'm looking for somebody to validate or dismiss test_malloc_compliance.exe as representative of a real program (I have not yet taken the time myself): if it can be reproduced that __TBB_USE_MAGIC has no significant overhead in a real program (anyone from Intel?), I can be more confident not to be wasting my time tying up the loose ends, which would save a not altogether insignificant amount of real and/or virtual memory on top of any benefit from the other proposed changes.


test_malloc_compliance looks more like synthetic benchmark, not as representative of a real program.
But if component is better in synthetic benchmark, then with substantial probability it will be better in real program too. But for memory allocators there are such things as fragmentation. I don't know.
If you want real programs then I think you can take some of the standard benchmarks for memory allocators. I think a number of such benchmarks must be out there. Or you can take benchmarks used for other allocators. Every academic paper on allocator contains description of mini-benchmark. For example:
http://people.freebsd.org/~jasone/jemalloc/bsdcan2006/jemalloc.pdf

BUT I still don't see how those magic numbers can help lock-free algorithms. What it's all about?

0 Kudos
Alexey-Kukanov
Employee
820 Views

Raf_Schietekat:
I'm looking for somebody to validate or dismiss test_malloc_compliance.exe as representative of a real program

Sorry, I missed this request during initial reading. test_malloc_compliance is just a test; it does not represent a real program, and not even a synthetical microbenchmark.

0 Kudos
RafSchietekat
Valued Contributor III
820 Views
"test_malloc_compliance looks more like synthetic benchmark, not as representative of a real program." It is obvious that a real program will behave very differently, but I don't see any performance penalty whatsoever, which is rather surprising.

"BUT I still don't see how those magic numbers can help lock-free algorithms. What it's all about?" Saving memory.

Oh well, if not right now, soon enough somebody will be interested enough to steal this task from me.

0 Kudos
Reply