TBB's UnFair Task Scheduling - Page 2

Shankar1 · ‎10-28-2009

I have a couple of questions regarding TBB's unfair scheduling.

1. I read through the user manual that describes Task Scheduler and this what it says

" Tasks in Intel Threading Building Blocks are efficient too because the scheduler is unfair. Thread schedulers typically distribute time slices in a round-robin fashion. This distribution is called fair, because each logical thread gets its fair share of time. Thread schedulers are typically fair because it is the safest strategy to undertake without understanding the higher-level organization of a program. In task-based programming, the task scheduler does have some higher-level information, and so can sacrifice fairness for efficiency. Indeed, it often delays starting a task until it can make useful progress."

And as I understand the above TBB scheduler makes sure that the thread executing a task runs until the task completes. Is this correct? If yes how is this done?

2. "Indeed, it often delays starting a task until it can make useful progress."
What does the above line mean?

3. Im planning to use the TBB scheduler in my application wherein I have a thread(boost thread) which waits on data packets to arrive. The boost thread creates a tbb::task and spawns it to the TaskScheduler and waits for taking the next data packet. The frequency at which the data packets arrive is not determinate( they may arrive at high frequency and sometimes at a low frequency). Hence the decision to use boost thread to wait on the data packets and spawn tasks for processing the packet.

And lets say I initialize the scheduler with number of physical threads( lets say 4). Also lets assume the boost thread reads four data packets and spawns tasks for them to the task scheduler( using task::spawn() ) and waits for further data packets to arrive. All the four worker threads in the scheduler steals a task each and start executing the stolen task. Now lets say a data packet arrives when all the worker threads are busy executing their task.

Now the question is when will the boost thread get its turn for collecting the data packet and spawning a task. Will it get its turn only when one of the worker thread finishes executing the task? How will the overall scheduling work now? Will it be fair or unfair?

4. Now in the above case the boost thread will always spawn a task into its own readypool and the worker threads will try to always steal from the ready pool of the boost thread. Is it possible that I target the spawn into the ready pools of other worker threads in the scheduler( may be randomly)? Because it makes sense for me to that because the boost thread will never process tasks in its own ready pool( they are always stolen by worker threads in the scheduler).

--------
Sankar

Dmitry_Vyukov · ‎10-28-2009

Quoting - Raf Schietekat

"So I said that if tasks are large, there is little sense in distributed queues."
Why is that decision related to size?

Distributed queues/deques are employed in order to minimize per-task overheads. If your tasks are so large that current per-tasks overheads are negligible, then you do not need distributed queues/deques.

RafSchietekat · ‎10-28-2009

Quoting - Dmitriy Vyukov

Distributed queues/deques are employed in order to minimize per-task overheads. If your tasks are so large that current per-tasks overheads are negligible, then you do not need distributed queues/deques.

I will have to take your word on that, because I don't know enough about queue implementation to draw any conclusion about per-task overhead.

Dmitry_Vyukov · ‎10-28-2009

Quoting - Raf Schietekat

"And it's not the case that there is only one thread to steal from. All threads but one may have not empty queues."
In this case it is in fact just the I/O thread that has (the most eligible) stealable work.

Nope, in my design IO thread immediately evenly distributes tasks over a pool of worker threads. So all worker threads may have work to do, and they may steal from each other.

And if you switch to discussion of another design, then I do not quite see the point anyway. If only IO thread has stealable work, then about what polling and about what inefficiencies are you talking about? Other threads will steal work only from IO thread, no polling, no inefficiencies.

Dmitry_Vyukov · ‎10-28-2009

Quoting - Raf Schietekat

"What exactly do you mean by affinity mailboxes? How do they differ from explicit queues?"
Explicit queues are so... explicit. :-)

Sorry, I do not understand you.

Dmitry_Vyukov · ‎10-28-2009

Quoting - Raf Schietekat

"Do you mean hack with allocate_additional_child_of()? Well... it's one of the options to "hack TBB scheduler" :)"
Seems more like normal use to me.

As far as I understand any normal use of TBB implies that thread that submits root task blocks until it completes. And that's what we are trying to avoid here. Any way around that is a hack IMHO.

RafSchietekat · ‎10-28-2009

#23 "Nope, in my design IO thread immediately evenly distributes tasks over a pool of worker threads. So all worker threads may have work to do, and they may steal from each other."
Well, "in this case" means(mostly) the original design. In this case (still the same...), the only thing that workers could steal from each other is parts of a packet, but it would be better to steal a full packet from the I/O thread instead.

#23 "And if you switch to discussion of another design, then I do not quite see the point anyway. If only IO thread has stealable work, then about what polling and about what inefficiencies are you talking about? Other threads will steal work only from IO thread, no polling, no inefficiencies."
Stealing is by random choice, so with a lot of worker threads there's a lot of opportunity to make a wrong or suboptimal choice, which can lead to a potentially significant waste of performance.

#24 "Sorry, I do not understand you."
Affinity mailboxes seem to offer exactly the qualities that you would look for to help workers find the most useful thing to do: a pointer to the task, and built-in overridability in case another thread wants to have a go. So why reimplement that outside of the scheduler...

Dmitry_Vyukov · ‎10-28-2009

Quoting - Raf Schietekat

I will have to take your word on that, because I don't know enough about queue implementation to draw any conclusion about per-task overhead.

Queue implementation is irrelevant here.
Regardless of the queue implementation distributed queues (where only 1 thread accesses each end of a queue most of the time) have lower overhead than single centralized queue. So if tasks are relatively small you prefer distributed queues.
And if intra-task parallelization is worth doing, i.e. tasks are at least 1 ms, then queuing overheads are irrelevant. Because every sane queue will have negligible overheads.

jimdempseyatthecove · ‎10-28-2009

Sankar,

Coordinating compute class tasks with I/O bound tasks is problematic with TBB. This is not the case with other parallel tasking systems such as QuickThread (www.quickthreadprogramming.com) which has two classes of thread pools (compute and I/O) with an elegant inter-class communication capability. Sketch of your program using QuickThread.

[cpp]// compute class task to process packet
void PacketProcessTask(Packet_T* packet)
{
	// may spawn multiple tasks
	// by using parallel_... templates
	// may include ArbitraryTask() below
}

// qtControl for I/O class task to write packet
qtControl SocketWriteControl;
// I/O class task to write packet
void SocketWrite(Packet_T* packet)
{
	// code to write to socket
	WriteToSocket(packet); // may wait for completion
}

// qtControl for I/O class task to read packets
// until done or until fatal error
qtControl SocketReadControl;
// I/O class task to read packets
// until done or until fatal error
void SocketRead()
{
	bool Done = False;
	while(!Done)
	{
		Packet_T* packet = ReadNextData(); // may wait for packet
		if(packet)
		{
			// schedule compute class task
			parallel_task(PacketTask, packet)
		}
		else
		{
			// check for recoverable error
			// or fatal error
		}
	}
}

// arbitrary compute class task
// which includes enque of packet to SocketWrite I/O class thread
void ArbitraryTask()
{
	// code here
	// ...
	// now write packet using I/O class thread
	parallel_task(IO$, &SocketWriteControl, SocketWrite, packet);
	// code here
	// ...
}

// application
int _tmain(int argc, _TCHAR* argv[])
{
	qtInit(-1, 3);	// compute class = # HW threads
			// I/O class uses 3 threads
	// start "forever" SocketRead I/O class task
	parallel_task(IO$, &SocketReadControl, SocketRead);
	// While above runs
	// run parallel app code here
	// ...
                // end of program cleanup
	// Assure SocketRead task completed (with/without error)
	SocketReadControl.WaitTillDone();
	// SocketWrite may have pending I/O
	SocketWriteControl.WaitTillDone();
	return 0;
}

[/cpp]

The TBB developers need to examine ways to resolve this issue. What is true above holds true for similar parallel programming requirements such as parallel_pipeline. The QuickThread implementation of parallel_pipeline makes use of both classes of thread pools. With occasional proding from others like me, the TBB development team will address this issue.

Dmitriy's suggestion of multiple dedicated queues is a good solution for certain types of problems. The sketch above is more of a general solution. Optimal performance may require a blend of techniques.

Jim Dempsey
www.quickthreadprogramming.com

RafSchietekat · ‎10-28-2009

Quoting - Dmitriy Vyukov

As far as I understand any normal use of TBB implies that thread that submits root task blocks until it completes. And that's what we are trying to avoid here. Any way around that is a hack IMHO.

More like a normal workaround to bridge the blocking-I/O world and the computational world, IMHO. Besides, TBB uses this internally several times.

Shankar1 · ‎10-28-2009

Quoting - Dmitriy Vyukov

I just meant how does worker thread handle absence of work. It may just sleep for some time and then recheck queues, sleep for some time, then recheck queues, and so on (this is called spinning). Or it may block on some kernel primitive (semaphore, condition variable, event) until new work arrive (this is called kernel blocking).

Oh ok. This is clear.

Shankar1 · ‎10-28-2009

Quoting - Dmitriy Vyukov

However note that your tasks must be large enough in order to intra-task parallelization to make sense. By large enough I mean, let's say, larger than 1 ms. Otherwise you better not do any intra-task parallelization at all.

My tasks are definitely more than 1ms. I can even say for sure that the split tasks would take more than 1 ms.

Shankar1 · ‎10-28-2009

Quoting - jimdempseyatthecove

Coordinating compute class tasks with I/O bound tasks is problematic with TBB. This is not the case with other parallel tasking systems such as QuickThread (www.quickthreadprogramming.com) which has two classes of thread pools (compute and I/O) with an elegant inter-class communication capability.

Oh ok. Is QuickThread available so that I can try my sample there?

Shankar1 · ‎10-29-2009

Raf Schietekat, Dmitriy Vyukov

Ok from what I can understand from all of your inputs is the following :

1. Since the frequncy at which the data packets arrive is not deterministic and its IO bound, it makes sense to create an IO thread to receive data packets.

2. Since my tasks are sufficiently intra parallelizable( and also recursively parallelizable) it makes sense to use the TBB scheduler.

2. However the only caveat with the above solution is that the worker threads in TBB scheduler would make a number of unsuccessful( or inefficient) stealing efforts in trying to steal from the other worker thread's ready pool. So it makes sense to some how target the tasks made by the IO thread on to the worker threads.

So to workaround this, one way is to make the IO thread create N dummy tasks and create additional_child_of the dummy tasks( as and when a data packet is available) in a round robin fashion. So this way the tasks created by the IO thread under a dummy tasks would be targeted to the worker thread that stole the dummy task. Is this correct?

What I dint understand is about that affinity stuff that you were talking about. Will it help me in any way?

RafSchietekat · ‎10-29-2009

I was just speculating whether this was part of the rationale behind Dmitriy's alternative design. But unless the tasks are smallish (they aren't) and you have dozens of hardware threads (you probably don't), I wouldn't really worry about this yet (seems like premature optimisation at this point, and you know what somebody famous said about that).

(Affinity marks a task as a preferable target for a specific thread to steal instead of a randomly chosen one, and also tells that thread where to find it. Its normal purpose in TBB is to enhance cache locality.)

jimdempseyatthecove · ‎10-29-2009

Quoting - Shankar

Oh ok. Is QuickThread available so that I can try my sample there?

First read the documentation on www.quickthreadprogramming.com
This is currently a Windows only solution (Linux is under development).

If your sample program is relatively simple I can take a look at it for comments.
use jim at quickthreadprogramming dot com.

Jim Dempsey

Dmitry_Vyukov · ‎10-29-2009

Quoting - Raf Schietekat

#23 "Nope, in my design IO thread immediately evenly distributes tasks over a pool of worker threads. So all worker threads may have work to do, and they may steal from each other."
Well, "in this case" means(mostly) the original design. In this case (still the same...), the only thing that workers could steal from each other is parts of a packet, but it would be better to steal a full packet from the I/O thread instead.

You answered to my post, and you mention N queues, so I was thinging that you are talking about my design. And I do not understand what's the original design.
Anyway, yes, IMHO, it would be better to get full packet from IO thread. That's how .NET TPL handle things. They have separate queue for "root" tasks submitted from external threads. If thread is out of work then it checks root queue first. This ensures that if there are enough separate root "tasks" then all the workers will work on separate root "tasks", instead of helping each other. IMHO that's reasonable.

Dmitry_Vyukov · ‎10-29-2009

Quoting - Raf Schietekat

#24 "Sorry, I do not understand you."
Affinity mailboxes seem to offer exactly the qualities that you would look for to help workers find the most useful thing to do: a pointer to the task, and built-in overridability in case another thread wants to have a go. So why reimplement that outside of the scheduler...

Ah, I see, you seem to mean TBB internal affinity mailboxes. Humm, well, affinity mailboxes may improve situation somehow. However note that other threads can't steal from mailboxes. So if thread is out of work and it's mailbox is empty then it will probably steal sub-task (packet part) from another worker, instead of picking brand new whole packet.
And you still have to get around IO thread blocking somehow...

Dmitry_Vyukov · ‎10-29-2009

Quoting - Raf Schietekat

More like a normal workaround to bridge the blocking-I/O world and the computational world, IMHO. Besides, TBB uses this internally several times.

Ok, let's call it "workaround" instead of "hack", I actually do not care too much about that.

RafSchietekat · ‎10-30-2009

#36 "Anyway, yes, IMHO, it would be better to get full packet from IO thread."
I'm assuming that step 5 in the scheduling algorithm implies that TBB won't see those first, but then it also never says anything about only stealing at greater depth than where it is now.

#37 "However note that other threads can't steal from mailboxes."
Open Source Reference.pdf 1.16 of 2009-07-03 says in 10.3.8 Affinity: "The id is a hint and may be ignored by the scheduler."

LIFO/FIFO is still a potential concern, however: some source comments say "mailbox queue" (FIFO), but the code itself uses names like push and pop (LIFO).

#38 "Ok, let's call it "workaround" instead of "hack", I actually do not care too much about that."
It was just confusing to read "In such situation you indeed better to somehow hack TBB task scheduler to do this.", which seemed to mean changing the library itself.

Dmitry_Vyukov · ‎11-01-2009

Quoting - Raf Schietekat

#37 "However note that other threads can't steal from mailboxes."
Open Source Reference.pdf 1.16 of 2009-07-03 says in 10.3.8 Affinity: "The id is a hint and may be ignored by the scheduler."

Sorry, of course, these tasks can be stolen. I meant that only mailbox owner gives priority to affinitized tasks, and other threads will ignore the fact that task is in mailbox. So I guess that when a thread steals task in most cases it will steal packet part instead of whole packet.
So mailboxes do not give you global "steal priorities", they give you only local "steal priorities", so to say.

RafSchietekat · ‎11-02-2009

Quoting - Dmitriy Vyukov

Sorry, of course, these tasks can be stolen. I meant that only mailbox owner gives priority to affinitized tasks, and other threads will ignore the fact that task is in mailbox. So I guess that when a thread steals task in most cases it will steal packet part instead of whole packet.
So mailboxes do not give you global "steal priorities", they give you only local "steal priorities", so to say.

I agree, but, while performance may be suboptimal, it may also be a good enough return for the implementation price, at least initially.While processing is lagging, which is when optimal processing is most important, all workers are likely busy on their own assigned work; it's only when a few have caught up already that efficiency would drop somewhat.

I'm not disputing that, in general, the best solution for infinite workloads will likely have to sidestep the scheduler. Using a queue seems entirely appropriate in this case. But, if nobody exercises the scheduler for infinite workloads, where's the incentive to improve it? :-)