Link Copied
Yes, concurrent_queue does use a spin lock of sorts to manage access to the queue, but the story is more complicated than that. Yes, the concurrent_queuepush and pop use calls to SpinwaitUntilEq and SpinwaitWhileEq to control access, but if you push into them, you'll see that they actually do an exponential backoff algorithm that past a threshold request a yield on the thread: inside that threshold the code spins on a pause instruction, outside that threshold, the code yields, either via a sched_yield on Linux and MacOS, and a sleep(0) on Windows.
The sleep(0) call documentation I found suggests
A value of zero causes the thread to relinquish the remainder of its time slice to any other thread of equal priority that is ready to run. If there are no other threads of equal priority ready to run, the function returns immediately, and the thread continues execution.
The documentation for sched_yield gives very similar information. If there are no other threads ready for execution, this might look like spinning.
The intent of the concurrent containers is to operate well in a fine-grained parallel environment so spin locks are the preferred method of operation. Backing off into the thread yield is intended to insure TBB cooperates with other running threads. Pop is not meant to be used as a long duration blocking call, thus this caution that comes in the reference manual:
If the push or pop operations block, they block using user-space locks, which can waste processor resources when the blocking time is long. Class concurrent_queue is designed for situations where the blocking time is typically short relative to the rest of the application time.
If you have need fora low contention queue, you might be better served by a lock around a std::queue (presuming that the lock you choose doesn't also spin). Under low contention, the cost of missing the lock are much less likely and the heavier weight much less of an issue.
I'm glad that I could help. Regarding sched_yield, I haven't had time to test it myself so I'll take your word for now about the latency of that call. However the 18.5 second runtime with the pop() versus 9.5 seconds has at least one other explanation than the latency of sched_yield. You could have one thread busy doing multiplies and the other "blocked" on the pop and so unable to steal multiply work from the first thread. If you're running on a two core machine, try this simple experiment: in your task_scheduler_init call, try using task_scheduler_init init(3); i.e., create a thread pool with three threads rather than two. If one thread is blocked on the pop, you should still have two threads left to do the multiplies. Any degradations from the 9.5 second execution time thenshould be due to pop spin overhead rather than pulling one thread from the multiplication pool.
Regarding the implementation of the two queue solution, yes it can be tricky. You'd want to make sure to keep latency on the high contention queue to a minimum WITHOUT starving the tasks that are sitting on the low contention queue.
Usage: ./qtest threads [iterations]For example, to start one concurrent_queue pop thread and two worker threads, run:
threads: Each letter causes a thread to start:
w: Worker thread, does multiplication in a loop
q: Waits on a pop() from a concurrent_queue
m: Waits on a condition variable
iterations: How many iterations multiplication loops should do
With any worker threads, runs until they are all complete.
Otherwise, run forever.
./qtest qwwTo start just two worker threads, run:
./qtest wwYou can adjust the number of iterations (which defaults to a billion) with a second parameter:
./qtest qww 10000000000I've been using the OS time command to compare runtimes and CPU usage, and strace -c to count syscalls. For example:
$ strace -c -f time ./qtest qwwThe results actually seem to change a bit depending on options to strace and whether strace or time is run first, but the timing is consistent on simply running ./qtest qww vs. ./qtest ww and watching a wall clock.
...
18.62user 2.67system 0:18.38elapsed 115%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+347minor)pagefaults 0swaps
Process 23223 detached
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
49.78 2.676167 2676167 1 wait4
49.70 2.672167 167010 16 futex
0.52 0.028094 0 1125143& nbsp; sched_yield
...
To #7: if you snooze, you lose (performance).
To #8: it's a tbb_thread for interacting with the database, it does not have anything else to do.
Can Someone explain me how different is pop from pop_if_present + sleep in hardware terms? I was thinking that pop also would make the thread sleep and check the queue periodically by waking up. Is it not true??
Thanks in Advance,
Gokul.
Well, I don't necessarily agree with the current implementation either, but I have not given it much attention yet. Locking well (be reactive without running the whole system into the ground) seems to be a fine art, withlittle use for fixed-length naps.
Can Someone explain me how different is pop from pop_if_present + sleep in hardware terms? I was thinking that pop also would make the thread sleep and check the queue periodically by waking up. Is it not true??
Pop will block thread until new item will arrive. This has 2 advantages:
1. No unnecessary waking and rechecking, which eats CPU time senselessly
2. Better reactiviness, OS will be able to schedule thread faster when new item will arrive
In my opinion, it would be better to completely separate queue and blocking logic. I.e. there would be nonblocking_concurrent_queue and eventcount (or gate, as you want). This would allow to reuse blocking logic with other containers, this would allow to reuse queue with different implementations of blocking logic, this would allow to block on several queues simultaneously (something which is not possible now).
In my opinion, it would be better to completely separate queue and blocking logic. I.e. there would be nonblocking_concurrent_queue and eventcount (or gate, as you want). This would allow to reuse blocking logic with other containers, this would allow to reuse queue with different implementations of blocking logic, this would allow to block on several queues simultaneously (something which is not possible now).
I agree that letting users pass inthe blocking strategy as an object isa more flexible design than letting them make a selectionamongfixed blocking alternatives.Butregardless of how it's implemented it would be a good think if concurrent_queue didn't have just one blocking strategy.
Pop will block thread until new item will arrive. This has 2 advantages:
1. No unnecessary waking and rechecking, which eats CPU time senselessly
2. Better reactiviness, OS will be able to schedule thread faster when new item will arrive
Actually i have trouble understanding this " Pop will block thread until new item will arrive ". Are you trying to say that this is an hardware interrupt, which invokes the thread? Only ways by which this can be done, to my limited knowledge would be either by interrupt or polling. You have already categorically rejected the option of polling. So interrupt should be the only way. Is there an hardware interrupt to wake up a software thread at the occurence of an event? I am asking this, because of my ignorance in analyzing the source code....
Actually i have trouble understanding this " Pop will block thread until new item will arrive ". Are you trying to say that this is an hardware interrupt, which invokes the thread? Only ways by which this can be done, to my limited knowledge would be either by interrupt or polling. You have already categorically rejected the option of polling. So interrupt should be the only way. Is there an hardware interrupt to wake up a software thread at the occurence of an event? I am asking this, because of my ignorance in analyzing the source code....
Consumer thread blocks on kernel synchronization object (semaphore or event). When producer produces new item, it signals the kernel synchronization object. OS puts consumer thread back to OS runnable queue. Eventually consumer thread will be scheduled for execution. Something like this.
All interaction done via kernel synchronization object. I think that on uni-processor machine really no need for hardware interrupts. On multi-processor machine possibly IPI (inter-processor interrupts) will be involved, if there is idle processor at the time of putting thread to OS runnable queue. But user-space code (TBB) never works with/mentions hardware interrupts.
I agree that letting users pass inthe blocking strategy as an object isa more flexible design than letting them make a selectionamongfixed blocking alternatives.Butregardless of how it's implemented it would be a good think if concurrent_queue didn't have just one blocking strategy.
Actually I was talking about full separation of queue and blocking logic, so that it will be possible to wait for, for example, 2 queues. But blocking queue can still be provided as wrapper around them, as reasonable compromise between usage simplicity and flexibility (and also for backward compatibility of course).
For more complete information about compiler optimizations, see our Optimization Notice.