Re: Sleeping Tasks And Waiting For Another Task To Wakeup

AJ13 · ‎08-06-2007

Hey all,

I've bought the Intel TBB book from O'Reilly, and have read the section on the Task Scheduler a couple of times (preparing to read it once again after this post). I'm somewhat stuck on one point...

My understanding is that a task cannot block, otherwise other tasks won't be able to execute. This makes sense to me. However, in my application I would like to have a task wait for some event to occur (non I/O), i.e. I want the task to sleep until some event occurs to reawaken it. This does not seem possible with TBB.

For example, say I have three classes: A,B,C. I want to say that A is going to "pause" and wait for some signal from C. I don't care what B does... B can continue to execute. My current understanding is that once A "pauses" nothing else is going to execute...

The book somewhat addresses this by saying that the producer/consumer model is best implemented with Pipelines rather than with tasks.... maybe I'm not thinking the right way yet. I don't think Pipelines will solve my problem...

I may be missing something, and I really do think I need to reread the chapter a couple more times to "get it" but are there any suggestions?

AJ

branan · ‎08-06-2007

Here's what I'd do:

1) Split task A at the spot where it should wait. Call the first part A1, and the second A2

2) At the end of A1, create an OS thread (one not managed by TBB) to wait for the signal from thread C

3) When the OS thread gets the signal, create a new TBB task A2

This will still work if you're waiting inside of a loop, it'll just take some creative coding to implement (I'll write up the pseudo-code if you need it, I don't have the exact details worked out yet)

AJ13 · ‎08-06-2007

Thank you for the response.

By "OS Thread" do you mean calling C functions to create a thread directly, in an OS-specific way?

I think that I might have to look at the code for TBB to understand how things work to really answer this question. It seems to me that each "worker thread" is going to have its own stack and instruction pointer, however I don't exactly understand how tasks are implemented... if each task had its own stack and instruction pointer, then that task could be "paused" ... but I suspect that tasks don't work that way. I'll read some more, and perhaps investigate the code with this question in mind... there is something that I don't understand yet.

AJ13 · ‎08-06-2007

Okay, after some code investigation, pondering and tea, I have a better grasp on the real nature of the problem.

I could do something like this:

class A : public task
{

task* execute()
{
// Make a task that just waits
WaitTask& waiter = *new( allocate_child() ) WaitTask();
spawn_and_wait_for_all(waiter);

return NULL;
}

}

class WaitTask : public task
{
task* execute()
{
// Not sure what goes here
}
}

Now, from what I've read I know that when a task is "waiting" for another task to complete, that the worker thread will steal tasks from other worker threads, according to some rules. The trick is, what does WaitTask do that's not just busy waiting, and not OS dependent... once WaitTask completes, the parent A will wake up and I have solved my problem.

The purporse of all this, by the way, is so that I can implement a discrete event simulation framework with TBB... and I basically need to be able to advance the simulation clock forward...

AJ

Alexey-Kukanov · ‎08-07-2007

aj.guillon@gmail.com:
The trick is, what does WaitTask do that's not just busy waiting, and not OS dependent... once WaitTask completes, the parent A will wake up and I have solved my problem.

If the job of WaitTask::execute() is not busy waiting or sleeping on an OS primitive but a sort of real job, and all you actually need is a barrier in A::execute() to separate pre- and post-processing, then yes, the above solution may work for you. Be aware, however,that TBB tasks are not interrupted or preempted by other tasks unless you explicitly call a wait_for_all method like in A (which effectively starts another task dispatch loop in the thread that executes A). So once the WaitTask::execute() method started, a thread that took the WaitTask will spend time running it. It is not quite possible to check some condition from time to time while doing useful work on other tasks in between.

Now I will try to answer some of your general questions above:

aj.guillon@gmail.com:
My understanding is that a task cannot block, otherwise other tasks won't be able to execute. ... I want to say that A is going to "pause" and wait for some signal from C. I don't care what B does... B can continue to execute. My current understanding is that once A "pauses" nothing else is going to execute...

More precisely, if a task blocks, a thread running the task blocks and is unable to execute other tasks. Other worker threads, if any, continue their work while there are spawned tasks. In your case above, B and C will continue execution if only there is another thread to run it. Only on a system with only one single-core CPU it will block forever (unless you explicitly specified more than one thread in tbb initialization call).

aj.guillon@gmail.com:
It seems to me that each "worker thread" is going to have its own stack and instruction pointer, however I don't exactly understand how tasks are implemented... if each task had its own stack and instruction pointer, then that task could be "paused" ... but I suspect that tasks don't work that way.

You are right. Of course, each worker thread is an OS supported thread and it has its stack and an IP. TBB tasks do not have a stack or an IP; they are light-weght, user-level C++ objects that expose a method to do the "job", and this method is called by a thread. See above regarding pausing inside a task.

AJ13 · ‎08-07-2007

Here is an example of exactly what I am attempting to do. Say I'm doing a simulation of a factory, which has an assembly line producing parts and I have a forklift which picks up those parts and does something with them. Here's the kinda code I would like to be able to write:

class AssemblyLine : public SimulationEntity

{

void makePart()

{

for(;;)

{

delay(3);

// Add a made part to a queue

// Signal the forklift

signalForklift();

}

};

class Forklift : public SimulationEntity

{

void signalledPartMade()

{

// When there was a part made, go to the

// producer

travel_to(producer);

// Pick up the part

}

void travel_to(Location x)

{

// Look up how long it takes to go to x,

// and say put it in travelTime

delay(travelTime);

// Now pick up the part, and perform

// some actions.

}

};

The delay() function is what I would like to find a solution for. Ideally I would like that 'delay' function to pause execution (i.e. wait for some child thread to return). After the child thread has returned, it can go on performing normal processing.

There would be a master scheduler, which would be able to wakeup tasks when the simulation clock has advanced far enough.

Any suggestions as to how this can be accomplished with TBB?

AJ13 · ‎08-09-2007

I expected TBB to be a complete threading library wrapper, but it doesn't seem like what I want to do is directly possible with TBB as is. TBB doesn't seem to expose the low level details required to solve my problem. Instead, it looks like the best solution (so far) is to create my own user-space threads for each entity which is to be simulated, and each of those entities will themselves use TBB for their computationally intense tasks... I'm not sure how efficient that would be, but that's the best solution I've come up with so far...

ARCH_R_Intel · ‎08-24-2007

[I was out on an 8 week sabbatical this summer, hence the delayed reply.]

From the outset, TBB was never intended to be a general threading wrapper. The world doesn't need another one of those. I once googled "xThreads" for x in {A,B,C,...}. It's amazing how many thread wrappers are out there :-)

General threading has to serve many purposes, such as hiding I/O latency, not only computational speedup. TBB targets only computational speedup, because that's where we thought we could offer significant advantages over existing threading packages. Furthermore, we only put stuff in TBB if it seems to offer significant benefit. For example, there is no concurrent priority queue class in TBB yet, because every time someone here has implemented one of the algorithms from the literature, they have found it is usually slower than simply using a lock around an STL priority queue. (This situation may change as hardware evolves.)

Your problem is one that I wouldlike to have a TBB solution that is easy to use, scalable, implementable without a special compiler/language, and is a significant improvement over using raw threads.But I don't know of any. The closest involves transforming the problem into a representation dependent on continuation-passing style (whichwhile efficient, is a pain to write!).

Because TBB isnot a general threading package, we've worked hardtoenable users to mix it with other threading packages. So your proposal to use user-space threads and TBB for the computationally intensive parts is what we expect users to do.

Parallel simulation is tricky stuff that a lot of smart people have worked on. There are new languages that might be a better fit for your problem. E.g, PARSEC (http://pcl.cs.ucla.edu/projects/parsec/) is a parallel simulation language. I don't know if the implementation is parallel or not. Another candidate is Erlang (http://www.erlang.org/), which supports lightweight concurrent processes. A commercial package for parallel simulation is offered by WarpIV Technologies (http://www.warpiv.com/). See http://www.warpiv.com/files/11948438.pdffor how that system uses "time warp" (rollbacks and antimessages) to get better speedup. [Disclaimer: I'm not endorsing any of the aforementioned systems.]

bjoernknafla · ‎08-25-2007

There __might__ be a way to use TBB for your problem: rethink your problem. The way you think is that there are entities (the forklift, the queue) that are more or less a thread (or an actor in a different parallel programming term). As already stated, a language like Erlang or the scripting language Io could perhaps help you with your view of the problem.

Current view: there are many independent entities that create or consume events and I want to describe my entities like they are independent objects/actors.

"TBB" view: there are a lot of entities in my simulation and they are build from a set of stages (consume event stage, compute based on events, create event stage, etc). Now you would first run the "consume events" stage for all your entities in parallel (for example using parallel_for). Then You would run the "compute" stage of all your entities in parallel and afterwards you run the "create events" stage in parallel for all entities.

This needs more memory because you have to buffer entity specific stage results after each stage to feed the next stage.

In the event checking stage you are also running a lot of code to check for events though it would be nicer to react to events instead of checking for a lot of events that haven't happened. An "event registry" might help then. Only the registry checks for an event. If an event has happened it adds the event checking functor of all entities that have registered with it into the range to be run in parallel by the next parallel stage.

This was just a quick idea - I am not really sure if this is of help to you.

Cheers,
Bjoern

AJ13 · ‎08-26-2007

Hi,

I've been working on this problem for a while now (since my first post). I did find a solution, and I've actually implemented a lot of the code for it already (I was waiting until I was finished to really explain how I solved the problem). In a couple of days when I've completed my implementation, I'll post a link to the complete code which is available under GPLv3.

Okay, I'm going to be informal to explain this best. Basically, I was really thinking the wrong way, and it is easier to go back to "old thinking" than to really see a problem in a new light...

The real problem, is the stack model of C++. Something like Stackless Python, or even just Python looks like it would be a good solution. In order to be able to get something like coroutines, you could use user-space threads as I initially thought out load. But I'm insistent for a C++ solution.

Suppose that I take some logic that has "delays" (which really would normally wake up the scheduler to pick the next task) like so:

void func() {


// CODE HERE

delay(a);


// CODE HERE

delay(b);


// CODE HERE

delay(c);

Well, the first kinda realization I had, was that I could do this:

void func()
{
// insert blocks with time delays
scheduler.insert(firstCodeBlock(), a);
scheduler.insert(secondCodeBlock(), b);
scheduler.insert(thirdCodeBlock(), c);

}

class firstCodeBlock()
{
void operator()()
{
// CODE HERE
}
};

class secondCodeBlock()
{
void operator()()
{
// CODE HERE
}
};

class thirdCodeBlock()
{
void operator()()
{
// CODE HERE
}
};

So each time there is a delay, or a wait for signal, I could really separate those blocks out. This would cause simulation writers to want to cry. I needed a way to use such a design, and a way that simulation writers could use this and not feel awkward.

The solution I came up with, was to instead have the user represent this logic as a state machine. I imagine it would be easier for a simulation writer to think only in terms of states and transitions, and then encode those states into something that my simulation engine could use. Valid states are pretty much taken from UML, and include: Start State, Final State, State, Fork Node, Join Node, Error State. Transitions from one node to another can either be unguarded (always taken), or guarded (only taken if the guard returns true).

The result, is code that looks like this:

 StateMachine machine("mymachine");

 NodePtr fork1 = machine.createStartNode("A")->addStateNode("B", noop, noop)->addForkNode("Fork1");

 NodePtr nodeC = fork1->addStateNode("C", noop, noop);


 nodeC->addTransitionTo(nodeC, p);


 NodePtr nodeD = nodeC->addStateNode("D", noop, noop, not_p);

 nodeD->addFinalNode("E", f);

 NodePtr join1 = nodeD->addJoinNode("Join1", not_f);


 NodePtr nodeF = join1->addStateNode("F", noop, noop);

 NodePtr join2 = nodeF->addJoinNode("Join2");

 join2->ad
dFinalNode("Final2");


 fork1->addStateNode("H", noop, noop)->addTransitionTo(join1);


 NodePtr nodeI = fork1->addStateNode("I", noop, noop);


 nodeI->addTransitionTo(join1, a);

 nodeI->addStateNode("J", noop, noop, not_a)->addTransitionTo(join2);

As I said, in a couple days I will post links to the code, but basically the second argument to most add node function is the guard, which is a function that must evaluate to true for that transition to occur.

In this design, there are special guards which are interpreted by the state machine execution model to mean "delay" or "wait for signal." As well, state machines which use Fork and Join have threads which execute in parallel (these segments become TBB tasks and are scheduled for execution).

So there is the solution I have in a nutshell, minus a lot of implementation details. After a second look, TBB doesn't seem to need low level interfaces for a good solution. There might be some serious problems with this approach, one of which is that I think some things are not easily represented by a state machine. In that case, a simulation writer may of course always revert back to implementing complex logic in the states themselves.

Let me know what you think. There is definitely higher overhead with this model, and I haven't done a detailed analysis of complexity of some of the algorithms I am using, however others who are smarter than me will have better ideas for those.

AJ

ARCH_R_Intel · ‎08-27-2007

Yes, the execution stack in C++ is part of the problem, and indeed a root limitation of TBB or anything like it.

Your state machine model might be a good solution, but I'm not an expert in parallel simulation. Be aware that the people who invented Time Warp found that straight forward simluation of time often encounters speedup limitations.

AJ13 · ‎08-27-2007

I'm not an expert at parallel simulation either, this is a project out of interests and is something that as I continue research, could become more powerful and useful as a utility.

Realistically, a time-based simulation will in fact encounter a bottle neck, because each time event that is unique has to be processed serially. If multiple events are registered for the same time, benefits of parallelization could be realized, otherwise code becomes serial.

I believe that it is possible to analyze dependencies of states, and execute ones which are not dependent upon each other in parallel. A transactional feature could also be useful, where results could be rolled back. However, these features can be investigated with the parallel state machine base using TBB.

AJ13 · ‎09-11-2007

See www.yetisim.org for posted sources. I have not yet provided anonymous SVN access on the site, and the implementation is not 100% complete yet... however some code is up, and it might help others with similar problems.

AJ