Intel® oneAPI Threading Building Blocks
Ask questions and share information about adding parallelism to your applications when using this threading library.

Stealing work from a sleeping thread

Jonathan_J_
Beginner
741 Views

I'm fairly new to TBB, although I have done some stuff with the basic parallel_ constructs (parallel_for, parallel_do, etc).  I know TBB operates on a "task-stealing" concept.  I have a list of tasks that I want to be done in parallel, but they must be synced with real-time at certain time frame intervals.  For example, if i have 20 tasks, and the time frame interval is 20 ms, each task will perform its work that needs to be done for the next 20 ms of the system, then go to sleep.  So if i have 12 threads (6 cores plus hyperthreading), i would like the first 12 tasks to fire off, then go to sleep when done, then when they have gone to sleep, for the other 8 tasks to jump in and steal those threads and do their work.  I was hoping the code below would do the trick, but found that the last 8 tasks would not start until the first 12 tasks were done sleeping.  How can i make this work using TBB's task-stealing?

[cpp]

#include "tbb/parallel_for.h"

using namespace tbb;

void doWork(int threadId) {
     for(int i = 0; i < 10; i++) {
          std::cout << "Did work #" << i << " on thread id: "<< threadId << std::endl;
     }
}

class ParallelThreads {
public:
void operator()(const blocked_range<size_t>& r) const {
     for(size_t i = r.begin(); i != r.end(); ++i) {
          doWork(i);
          sleep(10);
          std::cout << "finished work on thread " << i << std::endl;
     }
}
ParallelThreads()
{}
};

int main() {
     tbb::parallel_for(tbb::blocked_range<size_t>(0, 20), ParallelThreads());
     return 0;
}

[/cpp]

0 Kudos
10 Replies
jimdempseyatthecove
Honored Contributor III
741 Views

When you place a Sleep(n) in a task, the thread sleeps and is unusable for other purposes (other tasks). Refrain from using Sleep in tasks. Instead, simply end the task.

Simplified example where your application is only processing data at these 20ms intervals

// in main code
for(;;) { // for ever loop
waitForDataAt20msInterval(0);
tbb::parallel_construct_here(...);
}

Only the main thread waits (maybe with Sleep).

In a different situation, where there is other work to be done, consider spawning a non-TBB thread to collect the data (and wait/sleep) between intervals

// in main code
_beginthread(YourDataReaderFunction,0,NULL);
for(;;) { // for ever loop
if(YourHaveDataIndicator) {
tbb::parallel_construct_here(...);
}
// fall through and do other work here
}

Essentially you insert a polling of your data collection thread. This polling can be done by any task if the design requires it (but make the poling thread-safe).

A third alternative is to oversubscribe by one thread and use a task (that never completes) that collects the data and fires off subsequent tasks. Caution, if the data collection task participates in the processing (e.g. parallel_for) then it may be busy at the time of arival of next data.

Jim Dempsey

0 Kudos
RafSchietekat
Valued Contributor III
741 Views

With a name like "stealing", it is bound to happen that users interpret this as stealing a resource (the thread) from another interested party (a task). But it's the other way around: threads "steal" tasks from other threads.

For that to work, tasks should not attempt to stick around for a long time, except waiting for multiple descendants to finish their work (one way to express optional parallelism), or getting recycled (an optimisation technique).

If you don't find a prepacked higher-level algorithm to do what you want to do, you can use tasks directly of course. But even then they don't have to be used to persist information: you can also use other objects for that, and only let the tasks be the way to invoke those other objects, and push them to the back of a queue for another round later on. A single thread can spawn or enqueue the right number of tasks at regular intervals, and that may suffice for your purpose.

(Added) It seems a race was going on, and Jim won (by a nose). :-)

0 Kudos
Jonathan_J_
Beginner
741 Views

Appreciate the quick responses; i should have explained more about the problem, although your answers did help clarify a few things.  It's possible that what I hoping for is impossible using TBB constructs.

The parallel-running tasks that report data every 20 ms (for example), need to await updates from the main controller before they continue working on their next 20 ms frame.  To simplify the example, lets say I only have 2 threads, and 4 tasks.

  • Task A: 2 ms
  • Task B: 4 ms
  • Task C: 8 ms
  • Task D: 12 ms
  • The program starts; it kicks of Task A and B on Threads 1 and 2, respectively
  • At time 2 ms, Task A finishes and reports its data to the main thread; Task C then kicks off on the same thread
  • At time 4 ms, Task B finishes and reports its data to the main thread, Task D then kicks off on the same thread
  • At time 10 ms, Task C finishes and reports its data to the main thread. The main thread does some bookkeeping and realizes there are no tasks left that have not started processing, but that other threads are still processing. As a result, it puts Thread 1 to sleep for the remaining 10 ms of the 20 ms frame
  • At time 16 ms, Task D finishes and reports its data to the main thread. The main thread does some bookkeeping and realizes all tasks are done. As a result, it puts Thread 2 to sleep for the remaining 4 ms of the 20 ms frame, and updates the task objects with info needed to complete the next frame
  • At time 20 ms, both threads wake up, and the cycle repeats itself given the new information from the main thread

Perhaps it is due to my relative inexperience, but the I don't think the solutions you proposed would allow for that sort of feedback; Tasks A, B, C, and D would all just finish and then keep processing, instead of sleeping for the remainder of the 20 ms.

0 Kudos
jimdempseyatthecove
Honored Contributor III
741 Views

Sketch code:

for(;;) { // your process loop
waitForInput();
parallel_invoke(taskA, taskB);
} // for(;;)

taskA() { // this can be a Lambda function
doWorkTaskA();
parallel_invoke(formerMainThreadActionOnCompletionOfA, taskC);
} // taskA

taskB() { // this can be a Lambda function
doWorkTaskB();
parallel_invoke(formerMainThreadActionOnCompletionOfB, taskD);
} // taskB

taskC() { doWorkTaskC(); }

taskD() { doWorkTaskD(); }

Jim Dempsey

0 Kudos
Jonathan_J_
Beginner
741 Views

Is there any way to do that more generically? If i just have some array of tasks, that need to be completed in no particular order? Is this something I could do with the task_scheduler? 

Sorry if i misled you with my example, i was just trying to simplify my idea of what the execution would look like, not provide the actual model of the software i am attempting to implement.

If it is helpful, here are more exact specifications (i left this out earlier in an attempt to keep things simple):

We have an undefined amount of tasks, call them Simulations, stored in an array. The amount of simulations comes in as a program argument. A time frame is also passed in as a program argument. Each simulation is started at time 0 by the simulation manager (the manager remains on a separate thread, the main thread).  When the simulation has done the amount of work specified by the time frame, it reports back to the simulation manager and says, "I'm done for that frame".  It will then wait until the frame is over. Once the frame is completed (this is determined by a timer on the manager), it finishes waiting, and the simulations continues onward until it finishes the next time-frame's-worth of work, etc etc.

In a pthread implementation, it looks similar to

[cpp]

Manager.cpp
callbackFunc(Simulation_c *simPtr){
  pthread_mutex_lock(&waitingSimsMutex);
  simsLeftToFinish--; //keeps track of number of sims that have not reported yet

  //keep track of who broadcasted so that they dont wait for themselves
  bool broadcaster = false;

  if(simsLeftToFinish == 0) {
   broadcaster = true;
   pthread_cond_broadcast(&waitingSimsThresholdCv);
   simsLeftToFinish = numSims; //reset simsToFinish

   //lose the lock
   pthread_mutex_unlock(&waitingSimsMutex);
  }

  if(!broadcaster) {
    pthread_cond_wait(&waitingSimsThresholdCv, &waitingSimsMutex);
    pthread_mutex_unlock(&waitingSimsMutex);
  } 

  simPtr->updateWithInfo(info); //left out details of how we get info; irrelevant

  //left out details of how we determine time to sleep until next frame
  nanosleep(timeLeftInFrame);
}

main{
  for(int i = 0; i < numSims; i++) {
     Simulation_c *simPtr = simulations;
     pthread_create(&threads, &attr, &simThreadExecute, (void *)simPtr); //simThreadExecute simply says simPtr->Run
  }

Simulation_c.cpp
Run() { //called from simThreadExecute above

while(isRunning) {
  doSomeWorkAndCollectSomeData();
  manager.callbackFunc(this); //this callback is the one above, that provides proper data to Simulation_c for next time frame and puts it to sleep
  //here is where the task/simulation would wake up and loop back for the next time frame
} //while(isRunning) 

[/cpp]

0 Kudos
jimdempseyatthecove
Honored Contributor III
741 Views

Let's see if I have your description strait in my mind:

a) you have n different simulations (determined at run time)
b) each of the simulations runs multiple frames, where each frame of each simulation are in somewhat of a lock-step with the same frame number of the other simulations.
c) all simulations must finish the sequenced frame on the time interval listed.

t0 = yourTimeFunctionNow();
while(!done) {
parallel_for(blocked_range<int>(0,numSims,1), doFrame); // adjust chunk=1
t1 = yourTimeFunctionNow();
while(t1 - t0 < waitTime)
   someSleepFunc(waitTime - (t1 - t0));
t0 += waitTime;
}

Where doFrame is an object with an operator()(const blocked_range<int>& range) const function that processes one frame of one or more simulations.

Jim Dempsey

0 Kudos
RafSchietekat
Valued Contributor III
741 Views

There might be a problem here with not disconnecting the what and the how. Forget about threads: the simulations don't need one for themselves, and even the manager doesn't need one. The simulations don't have to be in an array. That's all technical design. As it happens, the array could be a good choice: it's easy to use that with the prepackaged parallel_for(), without any need to deal with tasks directly, as Jim has shown. Such a program will work on a uniprocessor, or one with infinite parallelism, all without preemptive scheduling, and you wouldn't even need to know.

Maybe there's one thing still to say about this, unless it's just a red herring. If those simulations have widely different timings, and you have to stick to an imposed tight cadence, you may have to find a way to schedule them to a specific number of threads anyway if otherwise there's a danger that one worker thread, e.g., ends up with two long simulations that don't fit together in the duration of a single frame. But that may be as simple as having the simulations time themselves in one round and using that information to negotiate their position in a priority queue for the next round (longest simulation first). It's normally suboptimal to have a single point from which to get the work (the head of that priority queue), but sometimes you've got to do what you've got to do. Unless those long simulations have meaningful internal potential parallelism, because expressing that may help TBB to divide the work more evenly, maybe enough to not have to bother with a queue.

0 Kudos
jimdempseyatthecove
Honored Contributor III
741 Views

>> there's a danger that one worker thread, e.g., ends up with two long simulations that don't fit together in the duration of a single frame

This is why I set the partitioning point down to 1 under the assumption that the computation for each simulation's frame is relatively long and not necessarily the same.

Jim Dempsey

0 Kudos
Jonathan_J_
Beginner
741 Views

Jim/Raf:

Appreciate both of your quick responses and helpful input. I was able to successfully implement jim's pseudocode today and got great results.  Part of the reason it took me so long to come around to this solution was because I was attempting to modify existing software, which was built to operate as a single instance, and therefore was not well-suited to act on a frame-by-frame loop.  Raf, thanks for getting me to disconnect the what and the how; made me realize that i needed restructuring in order to get things to fit in Jim's pseudocode.  

Additionally, I'm attempting to run this on a Xeon Phi and found the sleep times to be wildly off (I tell it to sleep for 5 ms, it sleeps for 20ms).  That's a whole separate issue, but I raised that another forum post for it here: http://software.intel.com/en-us/forums/topic/401479.  Not sure if that is either of your areas of expertise.

Lastly, how can I mark this thread as closed? I don't see any way to do that...

0 Kudos
jimdempseyatthecove
Honored Contributor III
741 Views

Treat Forum threads like old soldiers - "Old soldiers never die, they just fade away." Gen. Douglas McArthur.

The other stumbling block of converting old code is conceptualizing (adapting to) Tasks as opposed to Threads. This becomes relatively simple after you do a few conversions. Don't completely discount threads for Task as seperate threads may be more suitable for I/O.

Jim Dempsey

0 Kudos
Reply