- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'm trying out TBB for an application I'd like to add concurrency to. The application has a database component, so an easy way to take advantage of a second core is to write the database requests into a queue, then have another thread pull them out and interact with the database.
This seems to work well, and gives a pretty good speedup. However, I've noticed that when there isn't any database work to do, the threads waiting for database requests hog the CPU. The basic loop looks like this (I've edited a bit, so please excuse typoes):
tbb::concurrent_queue<DbRequest> *reqQ;
while(true) {
std::auto_ptr
{
LxHist_DataPage
// This will block.
reqQ->pop(reqPtr);
req.reset(reqPtr);
}
if (!req.get()) {
// Indicates manager is done
break;
}
processRequest(req.get());
}
Looking at the TBB code a bit, it seems that tbb::concurrent_queue::pop uses a spinlock, so is very inefficient when it's idle. Is that correct? Is there anything to be done about it, or should I just put a lock around a std::queue for a queue that's not always highly contentious?
Thanks!
----Scott.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Yes, concurrent_queue does use a spin lock of sorts to manage access to the queue, but the story is more complicated than that. Yes, the concurrent_queuepush and pop use calls to SpinwaitUntilEq and SpinwaitWhileEq to control access, but if you push into them, you'll see that they actually do an exponential backoff algorithm that past a threshold request a yield on the thread: inside that threshold the code spins on a pause instruction, outside that threshold, the code yields, either via a sched_yield on Linux and MacOS, and a sleep(0) on Windows.
The sleep(0) call documentation I found suggests
A value of zero causes the thread to relinquish the remainder of its time slice to any other thread of equal priority that is ready to run. If there are no other threads of equal priority ready to run, the function returns immediately, and the thread continues execution.
The documentation for sched_yield gives very similar information. If there are no other threads ready for execution, this might look like spinning.
The intent of the concurrent containers is to operate well in a fine-grained parallel environment so spin locks are the preferred method of operation. Backing off into the thread yield is intended to insure TBB cooperates with other running threads. Pop is not meant to be used as a long duration blocking call, thus this caution that comes in the reference manual:
If the push or pop operations block, they block using user-space locks, which can waste processor resources when the blocking time is long. Class concurrent_queue is designed for situations where the blocking time is typically short relative to the rest of the application time.
If you have need fora low contention queue, you might be better served by a lock around a std::queue (presuming that the lock you choose doesn't also spin). Under low contention, the cost of missing the lock are much less likely and the heavier weight much less of an issue.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Looking at the docs again, I see that warning. As you suggested, I tried re-implementing concurrent_queue as an STL queue, with a boost::mutex (which uses pthreads mutexes underneath). This is much better about giving up the CPU when not running, but Intel's concurrent_queue is quite a bit faster under a heavy load.
My application is a server, and I don't know at compile time if it's going to be heavily loaded on a dedicated machine, or lightly loaded on a machine with many other tasks, so neither queue is best under all circumstances. I guess I'll make it a compile-time option to pick which queue implementation to use.
It would be very convenient if the TBB concurrent_queue could back off into an OS mutex if it's idle for a while, providing the best of both worlds. For me at least, it would be very nice to just use it any time I needed a concurrent queue, instead of having to evaluate performance tradeoffs.
Thanks for your help!
---Scott.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
A quick comment on sched_yield: Although using sched_yield() is certainly better than a tight busy loop it still consumes resources, especially if it is called many times. While this is by no means a heavyweight operation, it still requires at least a thread context switch and a system call.
For example, I wrote a simple program with two threads to do multiplication in a loop (I have a 2-core machine, so the two threads with work should keep both cores busy). I had each thread do a billion multiply instructions, and that took about 9.5 seconds to complete. If I add in another thread which is just doing a single pop() on an empty queue, it instead takes about 18.5 seconds for the multiplies to complete, nearly twice as long. During this time the spinlock calls sched_yield() over a million times, nearly 60,000 times a second!
By contrast, if I wait on a condition variable, the slowdown is too small to reliably measure, and it calls futex() 20 times.
Unfortunately I have just enough expertise in concurrency to know that implementing this will be much harder to get right than it looks at first glance, so I'll probably put it off until I'm convinced a compile-time option isn't workable.
Thanks again for your comments!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'm glad that I could help. Regarding sched_yield, I haven't had time to test it myself so I'll take your word for now about the latency of that call. However the 18.5 second runtime with the pop() versus 9.5 seconds has at least one other explanation than the latency of sched_yield. You could have one thread busy doing multiplies and the other "blocked" on the pop and so unable to steal multiply work from the first thread. If you're running on a two core machine, try this simple experiment: in your task_scheduler_init call, try using task_scheduler_init init(3); i.e., create a thread pool with three threads rather than two. If one thread is blocked on the pop, you should still have two threads left to do the multiplies. Any degradations from the 9.5 second execution time thenshould be due to pop spin overhead rather than pulling one thread from the multiplication pool.
Regarding the implementation of the two queue solution, yes it can be tricky. You'd want to make sure to keep latency on the high contention queue to a minimum WITHOUT starving the tasks that are sitting on the low contention queue.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'm not using TBB's scheduler, I'm explicitly creating boost::thread objects, and I can see that the right number of threads are being created with strace.
Here is the code I'm using; it's just a few dozen straightforward lines. It uses the Boost threading libraries, which are available from boost.org
if you don't have them handy.
Run with no args for usage:
Usage: ./qtest threads [iterations]For example, to start one concurrent_queue pop thread and two worker threads, run:
threads: Each letter causes a thread to start:
w: Worker thread, does multiplication in a loop
q: Waits on a pop() from a concurrent_queue
m: Waits on a condition variable
iterations: How many iterations multiplication loops should do
With any worker threads, runs until they are all complete.
Otherwise, run forever.
./qtest qwwTo start just two worker threads, run:
./qtest wwYou can adjust the number of iterations (which defaults to a billion) with a second parameter:
./qtest qww 10000000000I've been using the OS time command to compare runtimes and CPU usage, and strace -c to count syscalls. For example:
$ strace -c -f time ./qtest qwwThe results actually seem to change a bit depending on options to strace and whether strace or time is run first, but the timing is consistent on simply running ./qtest qww vs. ./qtest ww and watching a wall clock.
...
18.62user 2.67system 0:18.38elapsed 115%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+347minor)pagefaults 0swaps
Process 23223 detached
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
49.78 2.676167 2676167 1 wait4
49.70 2.672167 167010 16 futex
0.52 0.028094 0 1125143& nbsp; sched_yield
...
// File: qtest.cpp
// Compile: g++ -Wall -lboost_thread -ltbb qtest.cpp -o qtest
#include
#include
#include
#include
#include
#include
#include
using std::cerr;
using std::endl;
using std::cout;
using std::auto_ptr;
void usage(const char *prog) {
cerr << "Usage: " << prog << " threads [iterations] "
<< " threads: Each letter causes a thread to start: "
<< " w: Worker thread, does multiplication in a loop "
<< " q: Waits on a pop() from a concurrent_queue "
<< " m: Waits on a condition variable "
<< " iterations: How many iterations multiplication loops should do "
<< "With any worker threads, runs until they are all complete. "
<< "Otherwise, run forever. "
<< endl;
}
/*! Contains a function operator that will do a pop on a concurrent_queue in a loop,
* printing every time it pops.
*/
struct PopQ {
void operator()() {
auto_ptr<:CONCURRENT_QUEUE>
while(1) {
int i;
q->pop(i);
cout << "Popped!" << endl;
}
}
};
/*! Contains a function operator that will do a wait on a condition variable in a loop,
* printing every time it wakes up.
*/
struct CondWait {
void operator()() {
auto_ptr<:MUTEX> mutex(new boost::mutex());
auto_ptr<:CONDITION> cond(new boost::condition());
while(1) {
boost::mutex::scoped_lock lock(*mutex);
cond->wait(lock);
cout << "Woke up!" << endl;
}
}
};
/* Do a series of simple multiplies in a loop.
*/
struct TimeWaster
{
/*! count says how many times to loop. */
TimeWaster(long long _count) : count(_count) { }
void operator()()
{
long long num = 1;
for(long long i=0;i
}
}
long long count;
};
int main(int argc, char* argv[])
{
long long loopSize = 1000000000LL; // Default: 1 billion
switch(argc) {
default:
usage(argv[0]);
exit(1);
case 3:
loopSize = atoll(argv[2]);
case 2:
/* OK, expected */
;
}
std::list<:THREAD> workThreads;
std::list<:THREAD> waitThreads;
// Parse thread string
for(unsigned i=0;i
case 'q':
// concurrent_queue waiter thread
waitThreads.push_back(new boost::thread(PopQ()));
break;
case 'm':
// Condition variable waiter thread
waitThreads.push_back(new boost::thread(CondWait()));
break;
case 'w':
// Worker thread
workThreads.push_back(new boost::thread(TimeWaster(loopSize)));
break;
}
}
if (workThreads.empty()) {
// This should wait forever.
(*waitThreads.begin())->join();
} else {
// Wait until all worker threads are done
while (!workThreads.empty()) {
cout << "Waiting for worker thread to finish" << endl;
(*workThreads.begin())->join();
delete *workThreads.begin();
workThreads.pop_front();
}
}
return 0;
}
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
To #7: if you snooze, you lose (performance).
To #8: it's a tbb_thread for interacting with the database, it does not have anything else to do.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Can Someone explain me how different is pop from pop_if_present + sleep in hardware terms? I was thinking that pop also would make the thread sleep and check the queue periodically by waking up. Is it not true??
Thanks in Advance,
Gokul.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Well, I don't necessarily agree with the current implementation either, but I have not given it much attention yet. Locking well (be reactive without running the whole system into the ground) seems to be a fine art, withlittle use for fixed-length naps.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Can Someone explain me how different is pop from pop_if_present + sleep in hardware terms? I was thinking that pop also would make the thread sleep and check the queue periodically by waking up. Is it not true??
Pop will block thread until new item will arrive. This has 2 advantages:
1. No unnecessary waking and rechecking, which eats CPU time senselessly
2. Better reactiviness, OS will be able to schedule thread faster when new item will arrive
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
In my opinion, it would be better to completely separate queue and blocking logic. I.e. there would be nonblocking_concurrent_queue and eventcount (or gate, as you want). This would allow to reuse blocking logic with other containers, this would allow to reuse queue with different implementations of blocking logic, this would allow to block on several queues simultaneously (something which is not possible now).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
In my opinion, it would be better to completely separate queue and blocking logic. I.e. there would be nonblocking_concurrent_queue and eventcount (or gate, as you want). This would allow to reuse blocking logic with other containers, this would allow to reuse queue with different implementations of blocking logic, this would allow to block on several queues simultaneously (something which is not possible now).
I agree that letting users pass inthe blocking strategy as an object isa more flexible design than letting them make a selectionamongfixed blocking alternatives.Butregardless of how it's implemented it would be a good think if concurrent_queue didn't have just one blocking strategy.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Pop will block thread until new item will arrive. This has 2 advantages:
1. No unnecessary waking and rechecking, which eats CPU time senselessly
2. Better reactiviness, OS will be able to schedule thread faster when new item will arrive
Actually i have trouble understanding this " Pop will block thread until new item will arrive ". Are you trying to say that this is an hardware interrupt, which invokes the thread? Only ways by which this can be done, to my limited knowledge would be either by interrupt or polling. You have already categorically rejected the option of polling. So interrupt should be the only way. Is there an hardware interrupt to wake up a software thread at the occurence of an event? I am asking this, because of my ignorance in analyzing the source code....
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Actually i have trouble understanding this " Pop will block thread until new item will arrive ". Are you trying to say that this is an hardware interrupt, which invokes the thread? Only ways by which this can be done, to my limited knowledge would be either by interrupt or polling. You have already categorically rejected the option of polling. So interrupt should be the only way. Is there an hardware interrupt to wake up a software thread at the occurence of an event? I am asking this, because of my ignorance in analyzing the source code....
Consumer thread blocks on kernel synchronization object (semaphore or event). When producer produces new item, it signals the kernel synchronization object. OS puts consumer thread back to OS runnable queue. Eventually consumer thread will be scheduled for execution. Something like this.
All interaction done via kernel synchronization object. I think that on uni-processor machine really no need for hardware interrupts. On multi-processor machine possibly IPI (inter-processor interrupts) will be involved, if there is idle processor at the time of putting thread to OS runnable queue. But user-space code (TBB) never works with/mentions hardware interrupts.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I agree that letting users pass inthe blocking strategy as an object isa more flexible design than letting them make a selectionamongfixed blocking alternatives.Butregardless of how it's implemented it would be a good think if concurrent_queue didn't have just one blocking strategy.
Actually I was talking about full separation of queue and blocking logic, so that it will be possible to wait for, for example, 2 queues. But blocking queue can still be provided as wrapper around them, as reasonable compromise between usage simplicity and flexibility (and also for backward compatibility of course).
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page