Intel® oneAPI Threading Building Blocks
Ask questions and share information about adding parallelism to your applications when using this threading library.

Question on tbb::concurrent_queue

ksgokul
Beginner
1,975 Views
Hi,
I have a requirement where in a thread has a lot of requests( in hundreds ) to be enqueued. I was just wondering, if i can form an array or something and enqueue it directly, it should be faster than touching the shared resource again and again. But i don't find an API like that in the reference documentation. Even if i enqueue it as an array, it should get dequeued element wise.
Can someone explain what would be the best approach in solving a problem like this?
Thanks in advance,
Gokul.
0 Kudos
28 Replies
Dmitry_Vyukov
Valued Contributor I
1,645 Views
How many producers and consumers do you have?
0 Kudos
ksgokul
Beginner
1,645 Views
Producers and consumers will be in the range of 5-20.
Gokul.
0 Kudos
Dmitry_Vyukov
Valued Contributor I
1,645 Views
> if i can form an array or something and enqueue it directly, it should be faster than touching the shared resource again and again.

Definitely. Batch enqueue is a useful feature in some situations.
AFAICT, TBB queue does not support it currently. I think that TBB queue can be patched to support batch enqueue.
However real advantage of batch enqueue can be revealed in node-based queues, where a thread can prepare a chain of linked nodes, and pass pointer to head and tail of the chain to a queue. I think it's quite easy to modify my node-based queue algorithm this way:
http://groups.google.com/group/lock-free/browse_frm/thread/c8e3201da4a6a300

Ah, OK, I can do it right now:
[cpp]void enqueue(node_t* head, node_t* tail)
{
tail->next_ = 0;
node_t** prev = (node_t**)
_InterlockedExchange((long*)&tail_, (long)tail);
prev[0] = head;
}


[/cpp]
So, 100 or 1000 nodes enqueued with a single InterlockedExchange. Try to beat it! ;)

0 Kudos
ksgokul
Beginner
1,645 Views
>> However real advantage of batch enqueue can be revealed in node-based queues
Can you explain what is the difference between batch node-based queues and tbb::concurrent_queue?
Thanks,
Gokul.
0 Kudos
Dmitry_Vyukov
Valued Contributor I
1,645 Views
Quoting ksgokul
>> However real advantage of batch enqueue can be revealed in node-based queues
Can you explain what is the difference between batch node-based queues and tbb::concurrent_queue?

Batch *enqueue* (not batch queue). Well, by batch enqueue I just mean a situation when you pass a batch of nodes to a queue for enqeueue, and the queue somehow optimizes it as compared to a series of single-node enqueues.

0 Kudos
ksgokul
Beginner
1,645 Views
Sorry my question was
"Can you explain what is the difference between your node-based queues and tbb::concurrent_queue?" I read in a post, that tbb::concurrent_queue can take more than 1 producer at a same time?
Thanks,
Gokul.
0 Kudos
Dmitry_Vyukov
Valued Contributor I
1,645 Views
Quoting ksgokul
Sorry my question was
"Can you explain what is the difference between your node-based queues and tbb::concurrent_queue?" I read in a post, that tbb::concurrent_queue can take more than 1 producer at a same time?
Thanks,
Gokul.

Tbb queue is array based. This means that in order to insert a batch of N elements, you need to do N object copies and hold the mutex all that time. O(N). Not very cool for a batch insert.

That my queue is node based. This means that in order to insert a batch of N elements, you need to update 2 pointers using 1 atomic exchange and 1 plain store. O(1).

0 Kudos
aminer10
Novice
1,645 Views


>That my queue is node based. This means that in order to insert a batch of N elements, you need to >update 2 pointers using 1 atomic exchange and 1 plain store. O(1).

But 'before' inserting the batch ,don't you need to organize itin a linklist form ?

And if you organize it in a link list form - and insert many nodes -,
that still will take O(N) insert... no ?

>[...] and hold the mutex all that time

That's true.


Amine Moulay Ramdane.


0 Kudos
aminer10
Novice
1,645 Views




>Tbb queue is array based...[...]
>That my queue is node based.
>and hold the mutex all that time

Not all the time, you can insert them one by one in my
lockfreeParallelQueue or lockfree_mpmc...

But innode basedqueue, you still have to access before the FREELIST or the
MEMORY MANAGERand you have to use aCASfor that, no ? and that will
add to the 'contention' also, so it's the asmeas the array based, think about it more !


Sincerely,
Amine Moulay Ramdane.

0 Kudos
aminer10
Novice
1,645 Views


But innode basedqueue, you still have to access before the FREELIST or the
MEMORY MANAGERand you have to use aCASfor that, no ? and that will
add to the 'contention' also, so it's the sameas the array based, think about it more !


Sincerely,
Amine Moulay Ramdane
0 Kudos
aminer10
Novice
1,645 Views

Hello,

I know that you are smart Dmitry...

So follow with me please...

If you win5$with theleft hand , and you lose 5$ withthe righthand ...


You have to see the big picture to be able to judge, not just the left hand...


So read carefuly.


But innode basedqueue, you still have to access before the FREELIST or the
MEMORY MANAGERand you have to use aCASfor that, no ? and that will
add to the 'CONTENTION' also, so it's the sameas the array based, think about it more !


So the left hand here is the node based queue , and the right hand is the FREELIST or
the MEMORY MANAGER.

So, to be able to judge you have to seejudge from the BIG picture not just the left hand...

>That my queue is node based. This means that in order to insert a batch of N elements, you need to >update 2 pointers using 1 atomic exchange and 1 plain store. O(1).

But 'before' inserting the batch ,don't you need to organize itin a linklist form ?

And if you organize it in a link list form - and insert many nodes -,
that still will take O(N) insert... no ?




Sincerely,
Amine Moulay Ramdane
0 Kudos
aminer10
Novice
1,645 Views

Hello,


I know that you are smart Dmitry...


As i said , if you win5$with theleft hand , and you lose 5$ withthe righthand,


You have to see the big picture to be able to judge, not just the left hand...


So if you take a look for example at my lock-free ParallelQueue...

Youwill notice thati havemade the following affirmation:


"Because my lock-free ParallelQueue algorithm is using a hash based method and lock striping - with multiple lock-free queues working as one lock-free queue, see the source code - to reduce more the contention and is using just a LockedInc() , so, i am REDUCING the duration for which locks are held AND REDUCING the frequency with which locks are requested, hence i am REDUCING A LOT the contention, so it's very efficient.

So, my ParallelQueue algorithm is lock-free and is using lock striping
and also avoids false sharing, it's why it scored 17 millions of pop() transactions
per second (see the graphs bellow)... better than flqueue and RingBuffer [...]."


But there is still an important information that i forgot to put on the page
(to be able to see the big picture not just the left hand !),

If you take a look at lockfree_mpmc , it's using a

LockedIncLong(temp) +
CAS(tail[0],lasttail,newtemp) on the push() side..


But if you take a look at my lock-free Parallelqueue, i amonly
a single lockinc()

etc.


Welcome:

http://pages.videotron.com/aminer/parallelqueue/parallelqueue.htm



Sincerely,
Amine Moulay Ramdane.


0 Kudos
aminer10
Novice
1,645 Views

Hello,


As you have noticed Dmitry, my Thread Pool engine is also using my lock-free
ParallelQueue to be able to SCALE better in the future...

http://pages.videotron.com/aminer/threadpool.htm


And as you have noticed , withmy Object Pascal Thread Pool Engine
i have implemented my Parallel Compression Library , my Parallel Sort Library etc.


Give me a smaller and efficientthing and i will design a better and bigger thing !

that was my idea !

It's amazing to allthe things that we can dowith just this efficient Object Pascal
Thread Pool engine that i have designed !

Not too bad! right Dmitry ?

Welcome:

http://pages.videotron.com/aminer/



Sincerely,
Amine Moulay Ramdane.


0 Kudos
ksgokul
Beginner
1,645 Views
>>>[...] and hold the mutex all that time

>That's true.
Actually i am more concerned about this.
Its actually O inserts without holding the mutex and O[1] with the mutex - for a batch based insert
Its actually O inserts with the mutex - if i make the batch insert get inserted one by one.
I think if we take the contention factor together with the Order Of Operations, we might get a fair picture.
Gokul.
0 Kudos
aminer10
Novice
1,645 Views


Hello,

If you ask me a question like:

"Your Thread Pool Engine and lockfree ParallelQueue and lockfree_mpmc,
are working with Windows, Linux , Solaris and MacOSX on x86 , why
not port them also to Sparc Solaris station also ?"

My answer:

That's very easy to port my Thread Pool Engine , lockfree ParallelQueue
and lockfree_mpmc to Sparc Solaris multicores station etc and i will do
it in the near future...

and FreePascal 2.4.2 is already here:

http://www.freepascal.org/download.var

http://www.freepascal.org/

Sincerely,
Amine Moulay Ramdane.


0 Kudos
aminer10
Novice
1,645 Views

>Actually i am more concerned about this.
>Its actually O inserts without holding the mutex and O[1] with the mutex
>- for a batch based insert Its actually O inserts with the mutex - if i make
>the batch insert get inserted one by one.
>I think if we take the contention factor together with the Order Of Operations,
>we might get a fair picture.
>Gokul

You still didn't get the BIG picture Gokul..

In the lock-free Array based queue there is *NO* CONTENTION on the locks
insidea freelist ora memory manager.

In the lock-free node-based one , there is a contention on the locks inside
the freelistor the memory manager...



Sincerely,
Amine Moulay Ramdane.


0 Kudos
Dmitry_Vyukov
Valued Contributor I
1,645 Views
> But innode basedqueue, you still have to access before the FREELIST or the
MEMORY MANAGERand you have to use aCASfor that, no ? and that will
add to the 'CONTENTION' also, so it's the sameas the array based, think about it more !

I'm not insane to use centralized memory manager in a parallel program (most likely I won't use a centralized multi-producer/multi-consumer queue too, but that's aside for a moment). So of course memory comes from a local cache. No contention here.


> And if you organize it in a link list form - and insert many nodes -
that still will take O(N) insert... no ?

It's local operation conducted not under mutex. Moreover it should be combined with node filling, so it's basically zero overhead.


> Because my lock-free ParallelQueue algorithm is using a hash basedmethod and lock striping

Well, you can use it with my queue too... However the better design would be to not use a centralized queue in the first place.
0 Kudos
aminer10
Novice
1,645 Views



Dmitriy wrote
>> Because my lock-free ParallelQueue algorithm is using a hash basedmethod and lock striping

>Well, you can use it with my queue too...


Right andkeep up the good work Dmitriy...


Sincerely,
Amine.




0 Kudos
aminer10
Novice
1,645 Views
I wrote:
>>>Because my lock-free ParallelQueue algorithm is using a hash basedmethod and lock striping

>>Well, you can use it with my queue too...

>Right andkeep up the good work Dmitriy..


I will add this Dmitriy..

Even if my lock-free ParallelQueue - that i am using inside my Object Pascal Thread Pool
Engine - doesn't scale beyong 100 cores or 200 cores ... you can still run many process
applications or threads in parallel that will use my Thread Pool Engine(s)and every process
application or threads will scale to 100 cores. Example my a Parallel Threadthat usemy
Parallel SortLibrary will scale to 100 cores and another thread or process that uses my
Parallel CompressionLibrary will scale to 100 cores etc. etc. so that yourthousand(s) cores
system will still besmartly UTILIZED, that's still fun ! rigth Dmitriy ?



Sincerely,
Amine Moulay Ramdane.



0 Kudos
Dmitry_Vyukov
Valued Contributor I
1,531 Views
Quoting aminer10
you can still run many process
applications or threads in parallel that will use my Thread Pool Engine(s)and every process
application or threads will scale to 100 cores.

That's pretty cool!

Unfortunately, I can't say the same about my code, I don't have a 100-core machine for tests...

0 Kudos
Reply