parallel task with several parallel stages

Walter_D_ · ‎03-25-2015

I have an algorithm that can be parallelised using the divide-and-conquer approach, but requires more than one parallel stage per task. In other wordsk, a given task can be split into several consecutive stages which cannot be done in parallel, but each stage can potentially be split into several parallel tasks. I have implemented this as sketced below.

struct staged_task
  : tbb::task
{
  const int num_stages=2;              ///< total number of parallel stages
  int stage=0;                         ///< stage counter
  staged_task(task_data const&);       ///< constructor
  bool little_work();                  ///< is this task best done serially?
  void serial_execution();             ///< do this task serially
  /// add another child to list of children
  void add_child(int&child_count, tbb::task_list&children, task_data const&data)
  {
    ++child_count;
    children.push_back(*new(allocate_child()) staged_task(data));
  }
  /// create child tasks by calling add_child()
  void make_children(int&child_count, tbb::task_list&children);
  /// execute
  tbb::task*execute() final
  {
    tbb::task*child=nullptr;
    // 1     serial execution
    if(0==stage && little_work())
      serial_execution();
    // 2     next non-empty parallel stage
    else if(stage<num_stages) {
      // 2.1 allocate children, skipping empty stages
      int child_count=0;
      tbb::task_list children;
      for(; stage<num_stages && child_count==0; ++stage)
	make_children(child_count,children);
      // 2.2 prepare for continuation and spawn children
      if(child_count) {
	recycle_as_continuation();
	set_ref_count(child_count);
	child = children.pop_front();
	spawn(children);
      }
    }
    return child;
  }
};

The real code is more complex with different types of staged tasks, but this is the essence. The code appears to work fine, but recently it ran into a problem when the number of active threads dropped to 2 (as judged by command top) and wallclock time was very long (much longer than 2 threads should have needed).

I was wondering whether my code above could be at fault. And/or otherwise how I can debug this issue.

RafSchietekat · ‎03-25-2015

I don't see anything pertaining to your question, but why are the stages spawned together (in the make_children for loop) for parallel execution (in "spawn(children);") even though you say that they are to be executed sequentially ("several consecutive stages which cannot be done in parallel")?

Walter_D_ · ‎03-25-2015

No, Raf, you got that quite wrong. The stages are not spawned together. make_children() only generates the (all possible) child tasks for the current stage. Hence, tbb::task_list children only ever contains child tasks of the same stage, or none.

Each call to execute() represents another non-empty stage. The for loop stops as soon as a stage has any child tasks, but skips empty stages (as indicated in the comments).

My question was whether the code in lines 32-37 is okay.

RafSchietekat · ‎03-25-2015

Oops... :-)

jiri · ‎04-02-2015

To me, the code looks OK. In fact, the way the children and the continuation are handled look like a textbook example. You use a continuation, rather than a safe continuation, but one of the children is returned from the task, which means that the reference count of the task cannot go down to 0 before it finishes. The reference count of the task is updated before the children are spawned. If I remember correctly, the recycle_as_continuation(); call basically just sets a flag that the task should be recycled, so the order of that call does not play a role.

The code example obviously leaves out some parts of the task management, so it's not clear how the stage_task tasks are created and spawned. However, this should not be able to limit the parallel execution of the spawned children. I assume there are enough of those children. TBB suspends unused worker threads, so if you only ever had two children (and no other tasks), you would not get more than roughly 200% CPU utilization in top.

Other potential issues limiting the parallelism I can think of:

1. some code called task_scheduler_init with a small number of threads. If this call is made multiple times, only the first one takes effect (until it is destroyed, then the thread count can change again), so even if you think you have set the number of threads to a higher value using the task_scheduler_init, it may in fact not be the case

2. some tasks have blocked in execute, effectively taking a thread from the TBB pool.

3. the CPU affinity mask of the process limits the number of threads that can be used. TBB checks the affinity and limits the number of worker threads accordingly.