sched_yield when pipeline input stage blocks leads to high CPU

Vivek_Rajagopalan · ‎03-23-2012

Hi everyone,

Thanks to all involved in creating/supporting this fantastic library. I have been using it for a long time and its working flawlessly.

The main TBB construction we are using is the pipeline. The input stage listens to network traffic and is therefore susceptible to blocking. In normal or high loads, things hum along smoothly because there is always enough work for the scheduler to map. However, in light or very light load conditions - there isnt enough work and thats where we are running into a little turbulence.

The trouble starts when the input filter blocks on a select/epoll and downstream tasks quickly dry up. This seems to drive the TBB scheduler to do this (we are on Linux):

1. Spin for a while looking for tasks to steal
2. If nothing then spin on PAUSE CPU instruction for a little longer
3. If still nothing then toss in a sched_yield to the loop
4. Go to 1

( The above is what I could glean from looking at the source code, could be off a bit )

The end result is the CPU usage shoots up close to 100% when there is little or no work. Now I dont have much of a problem with this :

1. The PAUSE instruction conserves CPU power (I read about it an intel doc somewhere sorry unable to locate it now). Is this really the case ?
2. sched_yields allows any other process tibe chosen by Linux scheduler. It isnt our fault that there is no one around interested to run !

The problem is that the linux program top show 99-100%, I am unable to explain satisfactorily to our users that is is okay and that our software is just executing PAUSE and constantly yielding. This brings me to the point of this post. Can we write a custom scheduler that will add a step 5 ? I dont want to rewrite the whole scheduler - just to introduce a new step

5. If after yielding for a while - still there isnt anything to steal - go to sleep or wait on a synchornization object (say a condition variable)

The only goal of this hack is to bring CPU usage down under light load. As a test I tried introducing this bit of code in the receive_or_steal_task method in custom_scheduler.h

            if( failure_count>=yield_threshold+100 ) {
                   usleep(20000); // <-- added this line , sleep for 20ms

This decreased the CPU usage significantly under light load (from 99% to 3%) and everything *seems* okay for now.

However, I am a little wary if this might introduce any race conditions. I am wondering if TBB experts have any comments on this approach. Are there any other options to reduce CPU usage as shown by top ?

---

PS: I considered restarting the pipeline when load is low. That would have been cleaner but the problem is once the pipeline starts, our software drops all privileges so it cant restart again.

PS: I dont want to use threads, because the TBB pipeline models the problem so well. We have broken up the tasks into about 16 parallel filters - very classy and elegant thanks to TBB. Except this one problem, which again I think is not really an issue. But I have a hard time explaining this away.

Thanks in advance,

Anton_M_Intel · ‎03-23-2012

Hello, Vivek

Thank you very much for the detailed description and showing yourself as a real customer who interested in a solution to this problem. We were aware of this so-colled idle-spinning problem (EDIT: well, probably this one is related but somewhat different) but were not able to fix it in a way which doesn't harm hot path under high load. However, there are still some ideas to try which are promising enough. So I hope your request will boost resources to check these ideas.

jimdempseyatthecove · ‎03-23-2012

I am going to take you on a little though experiment. Hold off on any "But that is not what I want to do." until after you digest what I have to say.

Assume your program is a shell of the program you want. The only purpose of which is to read the input data and discard it witout undue CPU overhead during periods of low packet rates. For reasons not stated here, you choose not to use interrupts/condition variables.

Essentially your main loop would have a polling input section containing the usleep(nnnn) as coded in your post. The usleep would not be used during high activity, only during low activity. The sleep period cannot be longer than an internal buffer overflow time. I cannot advise you as to if 20000 is too large or too small.

Note, the above code has no thread scheduler. The system does, but your code and the libraries it uses has no task/thread scheduler. Therefore, there is no need for you to incorporate such a device into the TBB task scheduler.

This is what you do:

Configure you input pipe such that it incorporates this usleep stall loop during periods of low packet rates. The input pipe only returns/exits only with a data packet or end of program situation (observation of global shutdown flag during stall loop).

Note, the remaining threads (all TBB threads except on stuck in input pipe). Will eventually suspend as there are no tasks to run (waiting for token).

Optional: Under the circumstance that your app expriences significant usleep sleep time when there is work being done by other threads and work pending, then consider oversubscription by 1 thread over what is available.

Jim Dempsey

RafSchietekat · ‎03-23-2012

"PS: I considered restarting the pipeline when load is low. That would have been cleaner but the problem is once the pipeline starts, our software drops all privileges so it cant restart again."
Can you explain that? I don't see how starting and stopping the pipeline should be linked to any privileges, and it seems only polite not to tie up the input stage in unduly long waits, unless you know there's no other work that could be done at the same time (in the same program).

I presume that this 100% is for one core, not the whole machine. That would be the situation, described before on this forum, where a worker thread is trapped in the input stage, and the master thread will never go to sleep for fear it might never be able to wake up again (or maybe it's just paranoid about those workers slacking on the job...). You should find that sometimes, if it is the master that gets trapped, CPU use does go down to 0% (so you wouldn't see this with only one thread).

Even with pause to alleviate the waste of energy (not just relevant on mobile devices!), it will waste machine resources that other programs might be using, but I don't think that brief sleeps do you much good on that front (correct me if I'm wrong), so the result might merely be still better energy use and... cosmetically-only improved machine use.

If you really cannot stop and restart the pipeline, the hack to elongate the pause seems acceptable if it gives you the results you want, because the thread gets itself out of that part of the code after a brief sleep and only gets back there if it again finds no work, so there's no problem with starvation or worse (it has no relation with race conditions in the strict sense). I would put the sleep after the part of the code that allows a worker to retire, though (the conditional code with the return NULL inside it).

Note that a pipeline with nothing but parallel filters, except perhaps for cancellability and interrupt behaviour, is practically equivalent to parallel_while/parallel_do with all those stages called one after the other: it's only when serial stages occur that pipeline gets to show its smarts.

Alexey-Kukanov · ‎03-24-2012

The problem must have been discussed here in past; you may search for pipeline in the forum.

The problem is that if it's not the main thread that blocks in the input filter, then the main thread can't do anything but spin-wait, like you described. All extra worker threads should go sleep, but the main thread cannot. This is what Anton mentioned as the known issue that we still think how to address.

As for the hack you mentioned, it seems safe to me. It certainly does not *introduce* any race conditions that were not there before.

Also note that top, at Linux, shows CPU load in percents of a single core, unlike Windows Task Manager that shows in percent of the sum of all cores. So if you have 8 cores running full-speed under Linux, top will show 800%. When top shows 100%, it means just a single core is running, or 12.5% of the whole machine.

Vivek_Rajagopalan · ‎03-25-2012

Hi Jim,

Yes the delay of 20ms was chosen because we figured the input filter could buffer upto 20ms of data before returning the work item to the pool. The hope is that the scheduler would either be spinning or in the sched_yield stage or would have woken up after the 20ms delay. But we risk dropping some work if the data rate wildly fluctuates and overwhelms the buffer.

I am having difficulty with this "Configure you input pipe such that it incorporates this usleep stall loop" . Do you mean I should look at changing how the pipeline class works rather than the scheduler ? I tried to do initially but I could find a place in pipeline.h where I could sneak in this behavior. The pipeline spawns the root task and thats it.

Thanks,

Vivek_Rajagopalan · ‎03-25-2012

Hi Raf,

Thanks for such a detailed reply,

>>Can you explain that? I don't see how starting and stopping the pipeline should be linked to any privileges, and it seems only polite not to tie up the input stage in unduly long waits, unless you know there's no other work that could be done at the same time (in the same program).

The app is a packet processing app. The input filter is a serial filter that reads packets from a Linux RX RING socket. The app first starts as root, the pipeline is born and is expected to last as long as the app runs. The input filter (when running as root) opens and configures the socket but once that is done drops down to a normal user. We can detect low load in the input fitler based on timeouts and shut the pipeline down (by returning a NULL token), but restarting wont work because we no longer have root privileges needed to open the socket. Also if the input filter stops, all tasks downstream dry up fast (in less than 1s).

One thought was to make the lifetime of the filter longer than the life of the pipeline. This however doesnt feel right and is too hairy due to the way the tokens are setup in our app.

>>I presume that this 100% is for one core, not the whole machine. That would be the situation, described before on this forum, where a worker thread is trapped in the input stage

Absolutely, I see both cases. For example we have an instance of our app on a cloud, we get CPU usage alerts alternating for example 100% use for 30mins then 0% for the next 30 mins. I guess the worker was trapped then a burst of traffic cleared that then the master got trapped. And it is always 100% (as observed by top) of course when this happens even when there are 8 hw threads..

Honesly I am going for the 'cosmetic' :-) to prevent these kinds of CPU alerts the cloud providers send you and also having to explain to every sysadmin why our app is so hungry !

Lastly,

>> I would put the sleep after the part of the code that allows a worker to retire, though (the conditional code with the return NULL inside it).

Would this (line 288) be the part where this condition is checked, the sleep happens after this. So I guess it might be ok.

281 if( failure_count>=yield_threshold+100 ) {

282 // When a worker thread has nothing to do, return it to RML.

283 // For purposes of affinity support, the thread is considered idle while in RML.

284 #if __TBB_TASK_PRIORITY

285 if( return_if_no_work || my_arena->my_top_priority > my_arena->my_bottom_priority ) {

286 if ( my_arena->is_out_of_work() && return_if_no_work ) {

287 #else /* !__TBB_TASK_PRIORITY */

288 if ( return_if_no_work && my_arena->is_out_of_work() ) {

289 #endif /* !__TBB_TASK_PRIORITY */

290 if( SchedulerTraits::itt_possible )

291 ITT_NOTIFY(sync_cancel, this);

292 return NULL;

293 }

294 #if __TBB_TASK_PRIORITY

295 }

A bit about the app:

The pipe uses two input serial filters

Acquire data
Stateful reassembly (eg of fragmented packets etc)

There is also an output serial filter

Write to file

Between them there are about 12 parallel filters (basically for metering of traffic data). Since the serial filters dont do much, I am able to achieve great results. Before TBB I was using threads and it wasnt pretty. A hidden benefit is that TBB pipeline forced us to think in terms of units of work and filters. To identify and split work into that which can be done in parallel. This resulted in a very modular internal architecture which would have required tremendous discipline & complexity to achieve with threads.

The pipeline is such a beautiful fit for this use case. But I wonder if all TBB projects that model a data acquisition followed by data processing run into this issue at some point.

PS : If you are interested this is the app - trisul.org

RafSchietekat · ‎03-26-2012

"The input filter (when running as root) opens and configures the socket but once that is done drops down to a normal user."
Even without any need to stop and restart the pipeline it would seem cleaner to keep such setup code outside the pipeline.

"Also if the input filter stops, all tasks downstream dry up fast (in less than 1s)."
I don't know if we're talking about the same thing, but (this time) I did disregard the cost of frequent pipeline stalls (in the form of undersubscription and extra latency), for which I don't know an easy solution or workaround.

"Honesly I am going for the 'cosmetic' :-) to prevent these kinds of CPU alerts the cloud providers send you and also having to explain to every sysadmin why our app is so hungry !"
Hmm, would it be doable to charge clients directly for metered power and cooling costs, technically and commercially... What would a hotel owner do if he had the monitoring tools to identify which guest it is that doesn't bother turning off the faucet when his bath is full?

"Would this (line 288) be the part where this condition is checked, the sleep happens after this. So I guess it might be ok."
I meant the whole block starting at line 285 as numbered above. It doesn't really matter, of course, it just seemed a bit strange to let a worker still get a quick nap before retiring, and it might be even better to only let a master take that nap (but I haven't checked how to do that).

"But I wonder if all TBB projects that model a data acquisition followed by data processing run into this issue at some point."
I'm afraid so.

sched_yield when pipeline input stage blocks leads to high CPU usage