concurrent_queue hogs CPU when idle

sgifford · ‎09-18-2007

Hello,

I'm trying out TBB for an application I'd like to add concurrency to. The application has a database component, so an easy way to take advantage of a second core is to write the database requests into a queue, then have another thread pull them out and interact with the database.

This seems to work well, and gives a pretty good speedup. However, I've noticed that when there isn't any database work to do, the threads waiting for database requests hog the CPU. The basic loop looks like this (I've edited a bit, so please excuse typoes):

tbb::concurrent_queue<DbRequest> *reqQ;
while(true) {
std::auto_ptr req;
{
LxHist_DataPage* reqPtr;
// This will block.
reqQ->pop(reqPtr);
req.reset(reqPtr);
}
if (!req.get()) {
// Indicates manager is done
break;
}
processRequest(req.get());
}

Looking at the TBB code a bit, it seems that tbb::concurrent_queue::pop uses a spinlock, so is very inefficient when it's idle. Is that correct? Is there anything to be done about it, or should I just put a lock around a std::queue for a queue that's not always highly contentious?

Thanks!

----Scott.

robert-reed · ‎09-20-2007

Yes, concurrent_queue does use a spin lock of sorts to manage access to the queue, but the story is more complicated than that. Yes, the concurrent_queuepush and pop use calls to SpinwaitUntilEq and SpinwaitWhileEq to control access, but if you push into them, you'll see that they actually do an exponential backoff algorithm that past a threshold request a yield on the thread: inside that threshold the code spins on a pause instruction, outside that threshold, the code yields, either via a sched_yield on Linux and MacOS, and a sleep(0) on Windows.

The sleep(0) call documentation I found suggests

A value of zero causes the thread to relinquish the remainder of its time slice to any other thread of equal priority that is ready to run. If there are no other threads of equal priority ready to run, the function returns immediately, and the thread continues execution.

The documentation for sched_yield gives very similar information. If there are no other threads ready for execution, this might look like spinning.

The intent of the concurrent containers is to operate well in a fine-grained parallel environment so spin locks are the preferred method of operation. Backing off into the thread yield is intended to insure TBB cooperates with other running threads. Pop is not meant to be used as a long duration blocking call, thus this caution that comes in the reference manual:

If the push or pop operations block, they block using user-space locks, which can waste processor resources when the blocking time is long. Class concurrent_queue is designed for situations where the blocking time is typically short relative to the rest of the application time.

If you have need fora low contention queue, you might be better served by a lock around a std::queue (presuming that the lock you choose doesn't also spin). Under low contention, the cost of missing the lock are much less likely and the heavier weight much less of an issue.

sgifford · ‎09-22-2007

Thanks mad reed,

Looking at the docs again, I see that warning. As you suggested, I tried re-implementing concurrent_queue as an STL queue, with a boost::mutex (which uses pthreads mutexes underneath). This is much better about giving up the CPU when not running, but Intel's concurrent_queue is quite a bit faster under a heavy load.

My application is a server, and I don't know at compile time if it's going to be heavily loaded on a dedicated machine, or lightly loaded on a machine with many other tasks, so neither queue is best under all circumstances. I guess I'll make it a compile-time option to pick which queue implementation to use.

It would be very convenient if the TBB concurrent_queue could back off into an OS mutex if it's idle for a while, providing the best of both worlds. For me at least, it would be very nice to just use it any time I needed a concurrent queue, instead of having to evaluate performance tradeoffs.

Thanks for your help!

---Scott.

robert-reed · ‎09-22-2007

Well, maybe you need a gear shift, a way to handle both high contention and low contentition situations: the boost::mutex around an stl::queue for low contention intervals (or you coulduse the tbb scoped mutex lock, which also uses pthreads mutex;-), and a concurrent_queue when things get busy. Each locking mechanism serves a distinct purpose. The blocking mutex idles the threads when there's no activity, and the spin waits serve best under load. You'd have to work out a mechanism to shift between, maybe thresholding on combined queue totals. I suppose you could hybridize the concurrent queue, but it's already yielding after spins prove fruitless. Any additional mechanism you incorporate to decide when to shift over to a mutex is going to add overhead to the spin wait and maybe hurt concurrent_queue performance under load.

sgifford · ‎09-24-2007

Yes, that sounds like exactly what I need.

A quick comment on sched_yield: Although using sched_yield() is certainly better than a tight busy loop it still consumes resources, especially if it is called many times. While this is by no means a heavyweight operation, it still requires at least a thread context switch and a system call.

For example, I wrote a simple program with two threads to do multiplication in a loop (I have a 2-core machine, so the two threads with work should keep both cores busy). I had each thread do a billion multiply instructions, and that took about 9.5 seconds to complete. If I add in another thread which is just doing a single pop() on an empty queue, it instead takes about 18.5 seconds for the multiplies to complete, nearly twice as long. During this time the spinlock calls sched_yield() over a million times, nearly 60,000 times a second!

By contrast, if I wait on a condition variable, the slowdown is too small to reliably measure, and it calls futex() 20 times.

Unfortunately I have just enough expertise in concurrency to know that implementing this will be much harder to get right than it looks at first glance, so I'll probably put it off until I'm convinced a compile-time option isn't workable.

Thanks again for your comments!

robert-reed · ‎09-25-2007

I'm glad that I could help. Regarding sched_yield, I haven't had time to test it myself so I'll take your word for now about the latency of that call. However the 18.5 second runtime with the pop() versus 9.5 seconds has at least one other explanation than the latency of sched_yield. You could have one thread busy doing multiplies and the other "blocked" on the pop and so unable to steal multiply work from the first thread. If you're running on a two core machine, try this simple experiment: in your task_scheduler_init call, try using task_scheduler_init init(3); i.e., create a thread pool with three threads rather than two. If one thread is blocked on the pop, you should still have two threads left to do the multiplies. Any degradations from the 9.5 second execution time thenshould be due to pop spin overhead rather than pulling one thread from the multiplication pool.

Regarding the implementation of the two queue solution, yes it can be tricky. You'd want to make sure to keep latency on the high contention queue to a minimum WITHOUT starving the tasks that are sitting on the low contention queue.

sgifford · ‎09-25-2007

Thanks again for your response.

I'm not using TBB's scheduler, I'm explicitly creating boost::thread objects, and I can see that the right number of threads are being created with strace.

Here is the code I'm using; it's just a few dozen straightforward lines. It uses the Boost threading libraries, which are available from boost.org
if you don't have them handy.

Run with no args for usage:

Usage: ./qtest threads [iterations]
threads: Each letter causes a thread to start:
w: Worker thread, does multiplication in a loop
q: Waits on a pop() from a concurrent_queue
m: Waits on a condition variable
iterations: How many iterations multiplication loops should do
With any worker threads, runs until they are all complete.
Otherwise, run forever.

For example, to start one concurrent_queue pop thread and two worker threads, run:

./qtest qww

To start just two worker threads, run:

./qtest ww

You can adjust the number of iterations (which defaults to a billion) with a second parameter:

./qtest qww 10000000000

I've been using the OS time command to compare runtimes and CPU usage, and strace -c to count syscalls. For example:

$ strace -c -f time ./qtest qww
...
18.62user 2.67system 0:18.38elapsed 115%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+347minor)pagefaults 0swaps
Process 23223 detached
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
49.78 2.676167 2676167 1 wait4
49.70 2.672167 167010 16 futex
0.52 0.028094 0 1125143& nbsp; sched_yield
...

The results actually seem to change a bit depending on options to strace and whether strace or time is run first, but the timing is consistent on simply running ./qtest qww vs. ./qtest ww and watching a wall clock.

// File: qtest.cpp
// Compile: g++ -Wall -lboost_thread -ltbb qtest.cpp -o qtest

#include
#include
#include
#include
#include
#include
#include

using std::cerr;
using std::endl;
using std::cout;
using std::auto_ptr;

void usage(const char *prog) {
cerr << "Usage: " << prog << " threads [iterations] "
<< " threads: Each letter causes a thread to start: "
<< " w: Worker thread, does multiplication in a loop "
<< " q: Waits on a pop() from a concurrent_queue "
<< " m: Waits on a condition variable "
<< " iterations: How many iterations multiplication loops should do "
<< "With any worker threads, runs until they are all complete. "
<< "Otherwise, run forever. "
<< endl;
}

/*! Contains a function operator that will do a pop on a concurrent_queue in a loop,
* printing every time it pops.
*/
struct PopQ {
void operator()() {
auto_ptr<:CONCURRENT_QUEUE> > q(new tbb::concurrent_queue());
while(1) {
int i;
q->pop(i);
cout << "Popped!" << endl;
}
}
};

/*! Contains a function operator that will do a wait on a condition variable in a loop,
* printing every time it wakes up.
*/
struct CondWait {
void operator()() {
auto_ptr<:MUTEX> mutex(new boost::mutex());
auto_ptr<:CONDITION> cond(new boost::condition());
while(1) {
boost::mutex::scoped_lock lock(*mutex);
cond->wait(lock);
cout << "Woke up!" << endl;
}
}
};

/* Do a series of simple multiplies in a loop.
*/
struct TimeWaster
{
/*! count says how many times to loop. */
TimeWaster(long long _count) : count(_count) { }

void operator()()
{
long long num = 1;
for(long long i=0;i num *= 2;
}
}
long long count;
};

int main(int argc, char* argv[])
{
long long loopSize = 1000000000LL; // Default: 1 billion
switch(argc) {
default:
usage(argv[0]);
exit(1);
case 3:
loopSize = atoll(argv[2]);
case 2:
/* OK, expected */
;
}

std::list<:THREAD> workThreads;
std::list<:THREAD> waitThreads;
// Parse thread string
for(unsigned i=0;i switch(argv[1]) {
case 'q':
// concurrent_queue waiter thread
waitThreads.push_back(new boost::thread(PopQ()));
break;
case 'm':
// Condition variable waiter thread
waitThreads.push_back(new boost::thread(CondWait()));
break;
case 'w':
// Worker thread
workThreads.push_back(new boost::thread(TimeWaster(loopSize)));
break;
}
}

if (workThreads.empty()) {
// This should wait forever.
(*waitThreads.begin())->join();
} else {
// Wait until all worker threads are done
while (!workThreads.empty()) {
cout << "Waiting for worker thread to finish" << endl;
(*workThreads.begin())->join();
delete *workThreads.begin();
workThreads.pop_front();
}
}

return 0;
}

ksgokul · ‎10-19-2008

Just a small question. Why can't you use pop_if_present and manage the CPU Hogging by introducing a sleep, when there are no requests?

Thanks,

Gokul.

Wooyoung_K_Intel · ‎10-20-2008

Quoting - ksgokul

That's also a legitimate solution. But, you probably don't want to make the caller of a pop()sleep; you would wantit to processother tasks if any.

TBB concurrent queue is optimzed for cases where pushes/pops occur in burst. Its blocking semantics made it necessary tomakethose who call pop()go to sleep if the queue is empty.

RafSchietekat · ‎10-20-2008

To #7: if you snooze, you lose (performance).

To #8: it's a tbb_thread for interacting with the database, it does not have anything else to do.

ksgokul · ‎10-22-2008

Quoting - Raf Schietekat

Can Someone explain me how different is pop from pop_if_present + sleep in hardware terms? I was thinking that pop also would make the thread sleep and check the queue periodically by waking up. Is it not true??

Thanks in Advance,

Gokul.

RafSchietekat · ‎10-23-2008

Well, I don't necessarily agree with the current implementation either, but I have not given it much attention yet. Locking well (be reactive without running the whole system into the ground) seems to be a fine art, withlittle use for fixed-length naps.

Dmitry_Vyukov · ‎10-23-2008

Quoting - ksgokul

Can Someone explain me how different is pop from pop_if_present + sleep in hardware terms? I was thinking that pop also would make the thread sleep and check the queue periodically by waking up. Is it not true??

Pop will block thread until new item will arrive. This has 2 advantages:

1. No unnecessary waking and rechecking, which eats CPU time senselessly

2. Better reactiviness, OS will be able to schedule thread faster when new item will arrive

uj · ‎10-30-2008

Couldn't concurrent_queue have a parameter which determinesthe blocking behaviour? One setting is like now, expensive waiting and fast responses, and the other would be cheap waiting and slow responses. Then one could easilytry out and measure which one is optimal, and maybe evenswitch blocking startegydynamically during runtime depending on how the concurrent_queue is actually used.

Dmitry_Vyukov · ‎10-30-2008

Quoting - uj

Couldn't concurrent_queue have a parameter which determinesthe blocking behaviour? One setting is like now, expensive waiting and fast responses, and the other would be cheap waiting and slow responses. Then one could easilytry out and measure which one is optimal, and maybe evenswitch blocking startegydynamically during runtime depending on how the concurrent_queue is actually used.

In my opinion, it would be better to completely separate queue and blocking logic. I.e. there would be nonblocking_concurrent_queue and eventcount (or gate, as you want). This would allow to reuse blocking logic with other containers, this would allow to reuse queue with different implementations of blocking logic, this would allow to block on several queues simultaneously (something which is not possible now).

uj · ‎10-30-2008

Quoting - Dmitriy V'jukov

In my opinion, it would be better to completely separate queue and blocking logic. I.e. there would be nonblocking_concurrent_queue and eventcount (or gate, as you want). This would allow to reuse blocking logic with other containers, this would allow to reuse queue with different implementations of blocking logic, this would allow to block on several queues simultaneously (something which is not possible now).

I agree that letting users pass inthe blocking strategy as an object isa more flexible design than letting them make a selectionamongfixed blocking alternatives.Butregardless of how it's implemented it would be a good think if concurrent_queue didn't have just one blocking strategy.

ksgokul · ‎11-04-2008

Quoting - Dmitriy V'jukov

Pop will block thread until new item will arrive. This has 2 advantages:

1. No unnecessary waking and rechecking, which eats CPU time senselessly

2. Better reactiviness, OS will be able to schedule thread faster when new item will arrive

Actually i have trouble understanding this " Pop will block thread until new item will arrive ". Are you trying to say that this is an hardware interrupt, which invokes the thread? Only ways by which this can be done, to my limited knowledge would be either by interrupt or polling. You have already categorically rejected the option of polling. So interrupt should be the only way. Is there an hardware interrupt to wake up a software thread at the occurence of an event? I am asking this, because of my ignorance in analyzing the source code....

Dmitry_Vyukov · ‎11-06-2008

Quoting - ksgokul

Actually i have trouble understanding this " Pop will block thread until new item will arrive ". Are you trying to say that this is an hardware interrupt, which invokes the thread? Only ways by which this can be done, to my limited knowledge would be either by interrupt or polling. You have already categorically rejected the option of polling. So interrupt should be the only way. Is there an hardware interrupt to wake up a software thread at the occurence of an event? I am asking this, because of my ignorance in analyzing the source code....

Consumer thread blocks on kernel synchronization object (semaphore or event). When producer produces new item, it signals the kernel synchronization object. OS puts consumer thread back to OS runnable queue. Eventually consumer thread will be scheduled for execution. Something like this.

All interaction done via kernel synchronization object. I think that on uni-processor machine really no need for hardware interrupts. On multi-processor machine possibly IPI (inter-processor interrupts) will be involved, if there is idle processor at the time of putting thread to OS runnable queue. But user-space code (TBB) never works with/mentions hardware interrupts.

Dmitry_Vyukov · ‎11-06-2008

Quoting - uj

I agree that letting users pass inthe blocking strategy as an object isa more flexible design than letting them make a selectionamongfixed blocking alternatives.Butregardless of how it's implemented it would be a good think if concurrent_queue didn't have just one blocking strategy.

Actually I was talking about full separation of queue and blocking logic, so that it will be possible to wait for, for example, 2 queues. But blocking queue can still be provided as wrapper around them, as reasonable compromise between usage simplicity and flexibility (and also for backward compatibility of course).