unusual concurrent bounded queue when used with SCHED_FIFO task

rishia · ‎10-03-2011

We have an application where 10 "receive" threads receive data from a socket, do some processing and enqueue them into one of 6 tbb concurrent bounded queues by calling push(). There are 6 "worker" threads (one associated with each of the 6 concurrent queues) that receive this data by calling pop. These 6 worker threads don't block on anything at all. The application requires low latency, and experiences lots of data in short bursts, with relatively long periods of inactivity between the bursts. We thought a concurrent queue was the best way to achieve low latency, and we decided not to use futexes, pthread mutexes, conditional variables, etc.

Now the problem: When we run this application with all the 16 threads using the default SCHED_OTHER scheduling policy, all is well. In this mode, we also let the scheduler decide which of othe server's 16 cores to use to schedule these 16 threads. In a different mode, we run this application with all 16 threads using SCHED_FIFO, and using realtime priorities of 50 for all threads. In this mode, most of the time, everything is okay, but every now and then, when we restart the application (lets say, emperically, 5% of all times we restart the application), either the 10 "receive" threads spin with 100% cpu utilization, or the 6 "worker" threads spin with 100% cpu utilization. And when this happens, the application is essentially hung. If the "worker" threads are spinning, they don't dequeue any data at all. All receive threads continue to call push() on the 8 queues, and eventually, the system runs out of memory, causing the app to terminate. If the "receive" threads are spinning, no data is dequeued from the socket buffers, and no data is enqueued into the tbb queues. The app doesn't terminate, but since nothing is happening, we end up needing to restar the app, and usually, everything comes up just fine after the restart. Sometimes, the entire server is toast, and we can't even ssh into it. My guess is that in this case all 16 cores at pegged at 100%, now allowing the sshd to run (ssh is the primary way we access this server).

Any ideas? when the "worker" threads are spinning, my guess is that they must be spinning on pop() because the tbb queues are empty at app startup, but they should then retrieve the data when the "receive" threads start pushing()-ing the data onto the queues. Also, its beyond me, why the "receive" threads spin at all. The queues are large enough that it usually takes 30-40 minutes for them to fill up if the "workers" aren't dequeueing any data. The only reason for the "receive" threads to spin should be the queue getting filled up. But it looks like they spin shortly after the app starts up, when the queues are empty.

Roman · ‎10-03-2011

Try to disable IPO in ICC options, I've discovered similarstrange error but can't design a test case... I'm using single file IPO for selected .cpp files to correct this error.

Alexey-Kukanov · ‎10-07-2011

In concurrent_bounded_queue,push() sleeps if no slot is available and pop() sleeps if no data. There can be short periods of busy-waiting (spinning), but long waits are always passive. On the other hand, TBB was not designed for real-time, so I would expect surprises.

How a receiver thread selects a queue to push data into - randomly?

Would you be able to show thread stacks when they spin?

rishia · ‎10-12-2011

Alexey, Thanks for your response. A receiver thread selects a queue based on the payload data it receives in the packets from the network. Each multicast packet received has data in it that identifies it to one of many "widgets". Picking with TBB queue to push the packet to is a simple matter of WidgetID % NUM_QUEUES.

Now, we got lucky and had this problem happen again just today (happens only ~once a week). I have a core file generated from kill -11. Here's some pertinent information:

1. output from ps command listing thread IDs that were spinning. The "FF" in the "CLS" column indicates SCHED_FIFO threads. All threads with the RTPRIO of 75 are Realtime SCHED_FIFO "receive" threads. All threads with the RTPRIO of 50 are the SCHED_FIFO "worker" threads - their %CPU is pegged at 100%.

PSR TID CLS RTPRIO NI PRI %CPU COMMAND

13 582253 TS - 0 19 2.2 HandlerApp

12 582254 TS - 0 19 0.0 HandlerApp

7 582255 TS - 0 19 0.0 HandlerApp

1 582256 TS - 0 19 0.0 HandlerApp

7 582257 TS - 0 19 0.0 HandlerApp

14 582277 TS - 0 19 0.0 HandlerApp

14 582278 FF 50 - 90 99.4 HandlerApp

15 582279 FF 50 - 90 99.5 HandlerApp

16 582280 FF 50 - 90 99.4 HandlerApp

17 582281 FF 50 - 90 99.4 HandlerApp

18 582282 FF 50 - 90 99.5 HandlerApp

19 582283 FF 50 - 90 99.4 HandlerApp

20 582284 FF 50 - 90 99.4 HandlerApp

21 582285 FF 50 - 90 99.4 HandlerApp

22 582286 FF 50 - 90 3.1 HandlerApp

23 582287 FF 50 - 90 99.4 HandlerApp

0 582288 FF 75 - 115 0.0 HandlerApp

1 582289 FF 75 - 115 1.0 HandlerApp

2 582290 FF 75 - 115 0.0 HandlerApp

3 582291 FF 75 - 115 10.9 HandlerApp

4 582292 FF 75 - 115 3.7 HandlerApp

5 582293 FF 75 - 115 5.5 HandlerApp

6 582294 FF 75 - 115 1.0 HandlerApp

7 582295 FF 75 - 115 0.2 HandlerApp

8 582296 FF 75 - 115 1.7 HandlerApp

9 582297 FF 75 - 115 0.6 HandlerApp

2 582298 TS - 0 19 99.6 HandlerApp

2 582299 TS - 0 19 0.0 HandlerApp

0 582300 TS - 0 19 0.0 HandlerApp

2 582308 TS - 0 19 0.0 HandlerApp

0 582309 TS - 0 19 0.0 HandlerApp

2 582310 TS - 0 19 0.0 HandlerApp

10 582311 FF 60 - 100 0.0 HandlerApp

11 582312 FF 60 - 100 0.0 HandlerApp

12 582313 FF 60 - 100 0.0 HandlerApp

13 582314 FF 60 - 100 1.6 HandlerApp

2. from gdb, we determined the thread based on the thread IDs from the output above, and got their stack traces. Here are some of them (of the 10 "worker" threads):

(gdb) thread 31

[Switching to thread 31 (process 582278)]#0 0x0000003a6b2cc057 in sched_yield () from /lib64/libc.so.6

(gdb) where

#0 0x0000003a6b2cc057 in sched_yield () from /lib64/libc.so.6

#1 0x00007ffff5393115 in tbb::internal::micro_queue::push (this=0x7fffec5ecca8, item=0x7fffec4f37a0, k=176, base=@0x7ffff039fb60) at ../../include/tbb/tbb_machine.h:163

#2 0x00007ffff53937f9 in tbb::internal::concurrent_queue_base_v3::internal_push (this=0x7ffff039fb60, src=0x7fffec4f37a0) at ../../src/tbb/concurrent_queue.cpp:385

(gdb) thread 30

[Switching to thread 30 (process 582279)]#0 0x0000003a6b2cc057 in sched_yield () from /lib64/libc.so.6

(gdb) where

#0 0x0000003a6b2cc057 in sched_yield () from /lib64/libc.so.6

#1 0x00007ffff5393115 in tbb::internal::micro_queue::push (this=0x7fffec5ecca8, item=0x7fffebcf27a0, k=184, base=@0x7ffff039fb60) at ../../include/tbb/tbb_machine.h:163

#2 0x00007ffff53937f9 in tbb::internal::concurrent_queue_base_v3::internal_push (this=0x7ffff039fb60, src=0x7fffebcf27a0) at ../../src/tbb/concurrent_queue.cpp:385

Thanks again for your help.

RafSchietekat · ‎10-12-2011

Hmm, does a futex_wait count as a yield to move a thread to the back of its queue in SCHED_FIFO scheduling? Just a wild guess so far, but I've got to run.

(Added 2011-10-13) Sorry, I shouldn't have tried to rush a half-baked reply: after sleeping and becoming runnable again, the thread would be inserted to the end of the list for its priority. Is Roman's advice helpful?

rishia · ‎10-13-2011

No worries! I didn't get a chance to try Roman's suggestion, only because we're using gcc and I haven't yet looked into gcc's equivalent for inter-procedural optimizations. One thing comes to mnd though - if my threads are spinning on push(), doesn't that mean the concurrent queue is full? But I'm using an unbounded queue, and my memory utilziation was still within the physical RAM limits, and swap was not being used at all. So, the queues could not possibly have been "full". Are there other conditions that can cause push() to busy wait indefinitely? I think the best thing to do is try Roman's suggestion. Although if it works, I'd still want to know why it worked.

RafSchietekat · ‎10-13-2011

"One thing comes to mnd though - if my threads are spinning on push(), doesn't that mean the concurrent queue is full? But I'm using an unbounded queue, and my memory utilziation was still within the physical RAM limits, and swap was not being used at all. So, the queues could not possibly have been "full". Are there other conditions that can cause push() to busy wait indefinitely?"
I'll leave it to Alexey to provide authoritative answers about the current implementation.

"I think the best thing to do is try Roman's suggestion. Although if it works, I'd still want to know why it worked."
I don't see which version and package you tried, but perhaps if _lin uses icc you might see improvement from switching to the _src package instead? Or you could wait for what Alexey has to say about that.

Alexey-Kukanov · ‎10-28-2011

Indeed the threads seem to busy-wait on push(). However, I see a few inconsistencies in what you wrote in different places, so would like to clarify it.

First, do you use bounded queue or unbounded queue?
Second, you said that the threads that spin are "worker" threads; but worker threads should pop(), not push(). Did you make some mistake somewhere?
Third, you said there are 10 receiving threads and 6 worker threads; but the ps output shows 10 threads with priority 75 (receiving?) and 10 threads with priority 50 (workers?). And how many HW threads are on the machine, by the way?

By the way, one possible reason for endless busy-waiting is zero capacity of a bounded queue. Can it happen that there is a race somewhere between setting capacity and using the queue?

unusual concurrent bounded queue when used with SCHED_FIFO tasks with realtime priorities