Cores are "used up" - Page 2

uj · ‎09-02-2009

I've used the Advanced Task Programming example from the TBB book to create tasks that run in parallel along the main application task. I have two such tasks (one manages a Direct3D window and the other performs heavy duty calculations). The main application task (in principle a win32 GUI) controls the parallel tasks by sending commands via a producer-consumer queue (one tbb::concurrent_queue perparallel task). When a parallel task isn't doing any work it sleeps (it waits on a tbb::mutex).

The problem I'm having is that each parallel task seems to "consume" one processor core. I have 4 cores and only 2 are available to do actual work in the main application task. I noticed this by first scanning a std::vector serially and thendoing the exact same scan in parallelusing a tbb::parallel_for. The scan in parallel is just twice as fast as the serial one. It should be4 times as fast shouldn't it? That's why I suspect the two parallel tasks are "holding on" to one core each leaving only 2 to do work in the main application task.

To test theabove assumption I switchedoff one of the parallel tasks (the Direct3D window). And now the parallel scan is 3 times faster than the serial. I then introduced a dummy parallel task so there's a total of 3 parallel tasks. Now the serial and the parallell scanning tasks are equally fast. With two dummy parallel tasks, that is a total of 4 parallel tasks, the application just hangs. There's no assertion or anything (from MS Visual Studio in debug mode).I can note that it's the last parallel task started that fails (because the theDirect3D windownevershows up as it should if the parallel task associated with it would work).

This behaviormakes me believe each parallel task is holding on to a core. According to the TBB book this isn't what should happen, is it? I've gotthe impressionthat all 4 coreswouldbe available in every task.

This is my version of theexample from the TBB book,

[cpp]#ifndef MTHREAD2_MINTONCE
#define MTHREAD2_MINTONCE

#include "tbb/task.h"

namespace mthreads {

	class MThread2 { // to be privately inherited
	protected:
		~MThread2() {}
		MThread2() : process(new(tbb::task::allocate_root()) tbb::empty_task) {}

		virtual void run() =0; // to be overridden. runs in separate thread

		void start() {
			process->set_ref_count(2);
			tbb::task* s(new(process->allocate_child()) Task(this));
			process->spawn(*s);
		}

		void stop(){
			process->wait_for_all();
			process->destroy(*process);
		}

	private:
		MThread2(const MThread2&);
		MThread2& operator=(const MThread2&);

		class Task : public tbb::task {
		public:
			~Task(){}
			explicit Task(MThread2* e) : enclosure(e) {}
		private:
			Task(); 
			Task(const Task&);
			Task& operator=(const Task&);

			tbb::task* execute() { 
				enclosure->run();
				return 0;
			}

			MThread2* const enclosure;
		};

		tbb::empty_task* const process;

	};

}
#endif
[/cpp]

uj · ‎09-07-2009

Quoting - Raf Schietekat

>> "When would you say the running length of such a task becomes a problem?"

> See #3, first paragraph.

It's perfectly clearthat for TBB to be useful you need to divide a program into tasks that can be executed in parallel, and themore tasksthe better. That's obvious and that'snot what we're discussing.

You are claiming that a task that runs in parallel with the rest ofa program for a long time is bad. What further is you're denying thatthe Advanced Task Programming examplein opposition to youholds the opposite view, namely thatthis in factis a good thing.

I'm very confident that you're wrong. Contrary to what you claim the running length of a TBB task is irrelevant for the TBB scheduling mechanism. It works equally well with shorter tasks used to break up an algorithm to run in parallel, as it does with longer tasks created to utilize overall parallelisms of a whole program. And contrary to what you claim,thisis exactly what the example from the TBB book states.

In short, to fully utilize TBB all kinds of parallelisms should be used. There's absolutely no reason why a TBB task cannot run forever (if the rest of the programdoes that too). In fact a longrunning task is much better than the alternative of using an ordinary thread as long as you realize that thelongrunning task isn't an ordinarythread andare willing to accept its limitations. In addition you'reencouraged to further parallelize the longrunning task by spawningmore tasks from it - shortlived, longlivedand anything in-between.

If you feel my position is wrong please give some technical evidence rather thankeep repeating the "TBB tasks must be shortlived because that's the way it is" mantra.

uj · ‎09-07-2009

Quoting - Alexey Kukanov.

> We continue looking for a better solution for this kind of design challenges.

Well, I keep pushing it a little.Something has to give. :)

At leastthe Advanced Task Programming example should have a disclaimerdiscussing the pros and cons of this approach in relation to usingordinary threads.

If my position is wrong now maybe it becomes right in the near future? :)

RafSchietekat · ‎09-07-2009

"Until anyone is coming up with firm evidence to the contrary I'm going to believe the example is showing good TBB usage."
At least we agree about something... but the devil is in (how you interpret) the details.

"I'm very confident that you're wrong."
How exciting!

"If you feel my position is wrong please give some technical evidence rather than keep repeating the "TBB tasks must be shortlived because that's the way it is" mantra."
A long-lived task is perfectly fine if it is only long-lived together with its children; just realise that it doesn't take long for task overhead to amortise to zero (a second is already like an eternity). A long-lived task is tolerable if it could not be completely broken up but is doing useful work all the time; it would only be a big stroke of luck if the other hardware threads all find something else to do in the meantime, and one of the central ideas in TBB is to create lots of tasks as opportunities for useful parallelism. It is in breach of contract if it is merely waiting for other things to happen for longer than predictably short pauses; instead, you should use user threads in such a situation, e.g., through the tbb_thread interface. Is that a mantra? Additionally, your design is problematic because it requires concurrency: another reason to use a user thread instead. Of course you are free to ignore this advice if you think you see a good reason to do so, but you should probably add a note about that for the benefit of whoever has to maintain the program after you.

I hope this has been useful, but if you still disagree maybe you should ask somebody from Intel to settle the matter.

mtsr · ‎09-09-2009

Quoting - uj

I'm very confident that you're wrong. Contrary to what you claim the running length of a TBB task is irrelevant for the TBB scheduling mechanism. It works equally well with shorter tasks used to break up an algorithm to run in parallel, as it does with longer tasks created to utilize overall parallelisms of a whole program. And contrary to what you claim,thisis exactly what the example from the TBB book states.

The fault in your reasoning is that unlike threads tasks aren't timesliced. This means that if you have a never-ending task running, it's going to be started on one thread and stay there, never allowing anything else to happen on that thread and never allowing other threads to contribute to the work that's happening in that thread (unless it's spawning more tasks as well).

Now sure, as long as you have 1 less never-ending task than you have cores in your pc (including 'fake' HyperThreading cores), your program will still work, but as you noticed in your OP you run into trouble as soon as you have one never-ending task more. For example, with 4 continuous tasks running on a system with 4 cores, there will be 4 threads, each running one of these tasks and any more tasks will just have to wait forever.

You also mentioned something along the lines of long-running tasks leading to more stable scheduling. While technically true (there's 1 less thread to schedule tasks to) there's really no reason not to run these tasks as independent threads and just decrease the number of available threads for the scheduler by one. The scheduler will not be able to schedule anything else on that thread anyway.

The whole point of TBB's approach is to subdivide big algorithms into nicely parallellizable tasks to let the scheduler keep all of the cores busy. One of the most important steps is to avoid blocking calls as much as possible, so that instead of the thread waiting it can run another task in between. Now your Direct3D window task sounds like the perfect example of what NOT to put in a single task. This task will be doing nothing while waiting for input, while waiting for the buffers to swap, etc. Now if you could make this into short, periodic tasks, all the waiting time can be filled with actual work.

turks · ‎09-10-2009

Quoting - mtsr

unlike threads tasks aren't timesliced.

Yes. You might find our experience interesting and perhaps helpful. In our application we had to alleviate a computation bottleneck in a large existing application. We wound up combining TBB mechanisms with a traditional dedicated Windows thread.
I created a TBB pipeline inside a newly created Windows thread. The input of the pipeline comes from popping lines to be processed off of a concurrent_queue object. Existing code, outside the thread, pushes lines to be processed into the queue. I switched to the bounded type queue and so, if the queue is at its capacity, the push operation gets blocked internally to the pop in an efficient manner. Similarly if processing inside the pipeline gets ahead so that the queue is empty, the pop operation gets similarly blocked. We limit the max number of objects in the pipeline to 8 (actually to the number of hardware cores found), as in Reinders' pipeline example.

Thus, though the TBB task threads do not get timesliced, since the whole pipeline is in a separate thread which DOES get timesliced, other processes run simultaneously.

It did take some programming help to get this running, namely that the task_scheduler has to be created inside the special thread even though the concurrent_queue is created outside the thread. Also, the input and output filters of the pipeline (both serial) must access existing code outside the thread. The TBB pipeline ensures that un-processed lines are received in order, processed fully in parallel, and the results written serially in sequential order.

Vivek_Rajagopalan · ‎09-10-2009

Quoting - turks

Quoting - mtsr

unlike threads tasks aren't timesliced.

Yes. You might find our experience interesting and perhaps helpful. In our application we had to alleviate a computation bottleneck in a large existing application. We wound up combining TBB mechanisms with a traditional dedicated Windows thread.

Thanks turks, I certainly found your experience interesting. I am also setting up a similar arrangement.

Pipeline 1 : Hosted in a tbb::tbb_thread

Processes packet data and generates messages

Pipeline 2 : Hosted in another tbb::tbb_thread

Processes the messages generated by pipeline 1.

What I found was
1. I could have merged the two pipelines into 1, but the filters inside the two pipelines work on competely different data. So this would break the main advantage the pipeline offers which is work stays in place and workers move. This is something to consider.

2. There has to be enough work for the filters in each work item. So I batched the input, say 200-500 work packets at a time. I dont know if this helps but I kept the total size of the work item to be about 1/2 the available cache.

>> Thus, though the TBB task threads do not get timesliced, since the whole pipeline is in a separate thread which DOES get timesliced, other processes run simultaneously.>>

3. That is interesting. I thought that hosting a pipeline inside a tbb_thread did not change its behavior significantly. Tasks are still mapped to the same internal tbb threads. Once a task (in our case a tbb::filter) grabs an tbb internal thread it does not let go until done.

4. I think tbb2.2 calls the task_scheduler_init automatically. Is it better to just let tbb handle the initialization the way it sees fit.

Thanks again for sharing your experience. This style of programming is so new that it is hard to find help on the internet. Well maybe in 5 years ..

RafSchietekat · ‎09-11-2009

"1. I could have merged the two pipelines into 1, but the filters inside the two pipelines work on competely different data. So this would break the main advantage the pipeline offers which is work stays in place and workers move. This is something to consider."
You might do well to challenge that assumption (and then tell us about your findings).

"3. That is interesting. I thought that hosting a pipeline inside a tbb_thread did not change its behavior significantly. Tasks are still mapped to the same internal tbb threads. Once a task (in our case a tbb::filter) grabs an tbb internal thread it does not let go until done."
I think you thought correctly.

turks · ‎09-11-2009

First off, Raf is the one with TBB experience. This is my first use of TBB and in fact Raf helped us get our app up and running. (Tnx)

My app was designed using the much earlier TBB 2.0 which I believe does not have "tbb_thread". My app uses
regular CreateThread() for the pipeline thread.

My inclination is that the use of two pipelines is preferable since they are operating on two logically different streams of data. Since they are all in the same process, even though you need a separate task_scheduler_init object in each thread, the internal subthreads are managed as a group by TBB. I think the coding is certainly easier and cleaner with separate pipelines.

Regarding the second point, that hosting a pipeline inside another thread does not change its tendency to not let go, Raf may be right. 1. My case uses a Windows thread, not a tbb_thread and that may make a difference.
2. We are seeing reduction in scalability if we start multiple versions of our app. This is another TBB discussion topic ("Many Simultaneous Task Schedulers"). One app never uses less than 40% CPU even on simple jobs where the queue pushes/pops are most often in blocked state. Four separate apps never achieve more than 2.5x throughput, even on an 8-core machine.

Our big benefit is for a single app running a complex job, for which the pipeline mostly remains full. Inside the TBB pipeline, 8 cores get used instead of one for all the crunching, resulting in a huge speed-up, limited only by Amdahl's rule. (The serial part(s) limit overall throughput.)

I am going to investigate whether use of tbb_thread instead of CreateThread has any advantage or disadvantage in our application.

RafSchietekat · ‎09-11-2009

"I am going to investigate whether use of tbb_thread instead of CreateThread has any advantage or disadvantage in our application."
A thread by any other API would run as fast.