Solved: Barrier Synchronization implementation

knmaheshy2k · ‎10-12-2009

I need to implement a template function of this type. Functions 1 to n has to be executed in parallel on multiple cores. Any suggestions on implementing it the most optimized way???

For(i=1 to 100)
{

Execute Function1;

Execute Function2;

Execute Function3;

Execute Function4;

.

Execute Function n-3;

Execute Function n-2;

Execute Function n-1;

Execute Function n;

//Barrier Synchronization
Wait till all n functions are completed
}

I did see parallel_invoke. But they have a limit on number of functions (10) that can be evaluated in parallel (http://www.ddj.com/go-parallel/article/showArticle.jhtml?articleID=216402628).

Alexey-Kukanov · ‎10-13-2009

Other suggestions:
- similarly to the task list approach suggested byRobert, you may use task_group class which has methods to start a task and to wait for all running tasks belonging to the task_group.
- if signatures of all functions are similar (i.e. contain the same number and types of arguments), you might put pointers to functions to an array, and run parallel_for over this array, calling your functions (via pointers) in parallel_for body.

The second suggestion, if possible to implement, could be somewhat more efficient, because the tasks to run will be produced by multiple threads, and stealing will be limited. In the task_group based solution, all tasks will be produced by the same thread and so all other threads will have to constantly steal a single task to execute.

But I also have a concern that parallelizing on the inner level likely has limited potential for scalability & speedup. May be reworking an algorithm to reduce or eliminate the need for barrier synchronization between outer loops would create additional parallelism. I do not know of course whether it is possible in your case.

View solution in original post

robert-reed · ‎10-12-2009

Quoting - knmaheshy2k

I need to implement a template function of this type. Functions 1 to n has to be executed in parallel on multiple cores. Any suggestions on implementing it the most optimized way???

For(i=1 to 100)
{

Execute Function1;

.

Execute Function n;

//Barrier Synchronization
Wait till all n functions are completed
}

I did see parallel_invoke.

How important is it that these functions all execute in parallel? There may be a few machines out there that can supply a hundred HW threads to run so many functions simultaneously, but not many. More likely is that you will see a serial cascade of tasks as threads complete previous function calls and go back to the scheduler for more work. Likely Function1 will complete long before Function100 is spawned.

That said, the simplest way to do this would be with a task_list that spawns all 101 tasks as a block. They'll get dispatched as idle HW threads use the scheduler to peel off tasks. Parallel_invoke tries to avoid this serial launch semantic by having a parallel invocation hierarchy, like a phone tree, that divides the task of spawning to several tasks.

knmaheshy2k · ‎10-12-2009

Thanks for replying Reed.

1. How important is it that these functions all execute in parallel?

A: I need to extract parallelism by executing these functions in parallel.

2.There may be a few machines out there that can supply a hundred HW threads to run so many functions simultaneously, but not many.

A: I get the hint. I know which machines you actually mean. Its one of those green things, isn't it?? Lets say I'm working on something similar but by blue. :)

3.More likely is that you will see a serial cascade of tasks as threads complete previous function calls and go back to the scheduler for more work. Likely Function1 will complete long before Function100 is spawned.

A: I'm fine with it. I don't have much problems with that part.

4.That said, the simplest way to do this would be with a task_list that spawns all 101 tasks as a block. They'll get dispatched as idle HW threads use the scheduler to peel off tasks.

A: Can you explain a bit more plz?

Thanks in advance!

robert-reed · ‎10-12-2009

Quoting - knmaheshy2k

Thanks for replying, Robert.

4.That said, the simplest way to do this would be with a task_list that spawns all 101 tasks as a block. They'll get dispatched as idle HW threads use the scheduler to peel off tasks.

A: Can you explain a bit more plz?

Well, I don't have time today to fully verify this, so this out-of-my-head implementation example is guaranteed not to compile and is probably missing some filigree here or there, but this will convey the basic idea:

tbb::task_list list;
for (int i = 0; i < n ; ++i) {
list.push_back( *new( tbb::task::allocate_child() ) OrderedFunctionTask(i) );
}
set_ref_count(n + 1);
spawn_and_wait_for_all(list);

Where OrderedFunctionTask is a class derived from Task whose function object selects the appropriate function, perhaps to initialize a local function pointer within the object. That should at least get you started.

knmaheshy2k · ‎10-12-2009

Quoting - Robert Reed (Intel)

Well, I don't have time today to fully verify this, so this out-of-my-head implementation example is guaranteed not to compile and is probably missing some filigree here or there, but this will convey the basic idea:

tbb::task_list list;
for (int i = 0; i < n ; ++i) {
list.push_back( *new( tbb::task::allocate_child() ) OrderedFunctionTask(i) );
}
set_ref_count(n + 1);
spawn_and_wait_for_all(list);

Where OrderedFunctionTask is a class derived from Task whose function object selects the appropriate function, perhaps to initialize a local function pointer within the object. That should at least get you started.

Thanks Robert. I got what you meant. Will be looking forward if someone else has an better solution.

RafSchietekat · ‎10-12-2009

Is parallelism required, i.e., would the program fail if run on a single thread? How big is n? Is it fixed or variable?

knmaheshy2k · ‎10-12-2009

Quoting - Raf Schietekat

Is parallelism required, i.e., would the program fail if run on a single thread? How big is n? Is it fixed or variable?

Hi Raf,

Yes, Parallelism is required. It wouldn't fail if it runs on single thread. I'm doing some experimentations.

n can vary from 10 to 100 or more. Where as the for loop iteration 1:100 is just for illustration. It usually would be 1:100000. Do you have any insights on what numbers (n and for loop iternation) would actually faster parallely than serial execution?

Alexey-Kukanov · ‎10-13-2009

Other suggestions:
- similarly to the task list approach suggested byRobert, you may use task_group class which has methods to start a task and to wait for all running tasks belonging to the task_group.
- if signatures of all functions are similar (i.e. contain the same number and types of arguments), you might put pointers to functions to an array, and run parallel_for over this array, calling your functions (via pointers) in parallel_for body.

The second suggestion, if possible to implement, could be somewhat more efficient, because the tasks to run will be produced by multiple threads, and stealing will be limited. In the task_group based solution, all tasks will be produced by the same thread and so all other threads will have to constantly steal a single task to execute.

But I also have a concern that parallelizing on the inner level likely has limited potential for scalability & speedup. May be reworking an algorithm to reduce or eliminate the need for barrier synchronization between outer loops would create additional parallelism. I do not know of course whether it is possible in your case.

RafSchietekat · ‎10-13-2009

"Yes, Parallelism is required. It wouldn't fail if it runs on single thread. I'm doing some experimentations."
The phrase "required parallelism" (or rather "required concurrency", I suppose) means specifically that the program would fail if processing would not at least preemptively switch between the different jobs (e.g., for a producer/consumer arrangement), so you actually don't require concurrency if it runs on a single thread. That's what TBB likes best (otherwise you might very easily run into trouble before you know it).

"n can vary from 10 to 100 or more. Where as the for loop iteration 1:100 is just for illustration. It usually would be 1:100000. Do you have any insights on what numbers (n and for loop iternation) would actually faster parallely than serial execution"
So it can vary at run time, and it's larger than the number of hardware/worker threads, right? You might very well go with #3, but Robert has already indicated the importance of recursive parallelism (here "parallel invocation hierarchy"), so I would instead consider making the functions accessible as range-addressable function objects (perhaps just with a numerical index, i.e., in an array), and using tbb::parallel_for instead of a tbb::task_list. To determine whether parallel is faster than serial, you have to consider both task size (relative to scheduling overhead) and available parallelism (are there enough tasks to distribute around and amortise the wait at the end?), and then I would suggest to just try different arrangements (there may also be other factors that affect the outcome), depending on how crucial optimisation is.

(Added) Hmm, this is what happens if you cannot obtain a lock before answering...

Alexey-Kukanov · ‎10-13-2009

Quoting - Raf Schietekat

(Added) Hmm, this is what happens if you cannot obtain a lock before answering...

Great minds think alike :)

knmaheshy2k · ‎10-13-2009

Thanks Alex and Raf! Good day!

jimdempseyatthecove · ‎10-13-2009

I too agree that if possible use a functor list and a parallel_for or parallel_for_eachfor task distribution.

If you were to use a task list or spawn individual tasks you would have one task spawn worth of overhead per function (n spawns).

However, using parallel_for or parallel_for_each you would incure number of (participating) threads task spawn worth of overhead.

Reduction of task scheduling overhead is important for writing efficient code.

Now your code might look something like the following
Note the following is written in QuickThread as opposed to TBB however nearly the same functionality is available in TBB. You should be able to adapt this relatively easy to TBB. (info on QuickThread available on www.quickthreadprogramming.com)

[cpp]#include "QuickThread.h"
#include "parallel_for.h"

struct YourContext
{
  // define your context
  int dummyToAvoidNullStruct;
};

void YourFn1(YourContext* context) { return; }
void YourFn2(YourContext* context) { return; }
// ...


void YourFn1000(YourContext* context) { return; }

typedef void(*YourFunctorTakingYourContext)(YourContext*);

YourFunctorTakingYourContext YourFunctorList[] =
{
   YourFn1,
   YourFn2,
   //...
   YourFn1000
};

const int YourFunctorListCount = sizeof(YourFunctorList) / sizeof(YourFunctorList[0]);

void SomewhereInYourCode()
{
  for(int i=0; i < 100; ++i)
  {
     YourContext context;
     // initialize context for this iteration
     //...
     parallel_for_each(
	0,YourFunctorListCount,
	[&](int ifn) { (*YourFunctorList[ifn])(&context); });
  }
}
[/cpp]

Jim Dempsey

robert-reed · ‎10-13-2009

Quoting - knmaheshy2k

Thanks Alex and Raf! Good day!

Well, poo! If I had known n could get as high as 100,000, even I would have not suggested the linear process of building task list. ;-) A task list that large would quickly become an in-memory-only list (not cached anywhere), an anchor line dredging up tasks from the depths. Alexey's suggestion of a parallel_for seems the best in that it uses the available HW threads in concert to divide and dispatch the functions in parallel and in scalable blocks that can be worked on independently. I would not suggest parallel_for_each as someone recommended, since it has the same semantics as parallel_do and the same linear access as the task_list I originally suggested.

So whatcha building? Sounds like a possible implementation of a time step in some sort of simulator.

knmaheshy2k · ‎10-13-2009

Quoting - Robert Reed (Intel)

So whatcha building? Sounds like a possible implementation of a time step in some sort of simulator.

Bulls eye!! You got it right Reed!! Cant disclose here anything much, but can let you know offline. mnanjund, you know what it is.. :)

robert-reed · ‎10-13-2009

Quoting - knmaheshy2k

Bulls eye!! You got it right Robert!! Cant disclose here anything much, but can let you know offline. mnanjund, you know what it is.. :)

I don't need to know more details but I'd suggest that with dispatch in hand, the next problem you might encounter is the potential for side effects from the function calls. Presumably each represents a functional block whose inputs have been set at the start of the parallel section and each will have outputs to set before the end of the section. How those steps land and how they might affect cache residency among the threads would be my next area of concern. You can buffer the hell out of everythingbut still those buffers need to reside somewhere in the address space and may be the subject of contention themselves as the outputs from various functions are recombined to form the inputs for the next step.

jimdempseyatthecove · ‎10-13-2009

Robert,

My comments suggested using parallel_for or parallel_for_each, the choice left up to the programmer. I illustrated parallel_for_each. The choice of template depends on the variances in estimated compute time for each function. When the function execution time is relatively the same then parallel_for can be used. When execution time varies greatly between functions then parallel_for_each might be in order. parallel_for_each in QuickThread may have lower overhead than parallel_for_each in TBB. The scheme in QT for parallel_for_each is to divide the iteration space up by the current number of available threads (current thread may be included or excluded from set), schedule an additional parallel_for_each for n-1 partitions and directly call for one of the partitions. The task for the n-1 partitions starts (now having one less available thread) the makes same division choice. At some point all threads (may optionally be subset of all threads on system) are active. At this point the sub ranges functions objectsare directly called (no furthertask enqueue/dequeue). As the sub ranges are processed by a shell function the scheduler is examined for idle threads (peek at one word). When idle thread is noticed by any of the remaining running shell functions then it will split its remaining range as done at the beginning. While parallel_for_each has more overhead than parallel_for it has the benefit of dynamic load balancing which helps whenthe functions in your functor list have indeterminant run times and when the system is running other applications than yours.

The parallel_for syntax would be as follows

[cpp]#include "QuickThread.h"
#include "parallel_for.h"

struct YourContext
{
  // define your context
  int dummyToAvoidNullStruct;
};

void YourFn1(YourContext* context) { return; }
void YourFn2(YourContext* context) { return; }
void YourFn1000(YourContext* context) { return; }

typedef void(*YourFunctorTakingYourContext)(YourContext*);

YourFunctorTakingYourContext YourFunctorList[] =
{
   YourFn1,
   YourFn2,
   //...
   YourFn1000
};

const int YourFunctorListCount = sizeof(YourFunctorList) / sizeof(YourFunctorList[0]);

void SomewhereInYourCode()
{
  for(int i=0; i<100; ++i)
  {
    YourContext context;
    // initialize context for this iteration
    //...
    parallel_for(
      0,YourFunctorListCount,
      [&](int ifnBegin, int ifnEnd) { 
        for(int ifn = ifnBegin; ifn < ifnEnd; ++ifn)
          (*YourFunctorList[ifn])(&context); });
  }
}
[/cpp]

As an alternative you can also use a chunking factor

int iChunk = 2; // your number of functions per slice of iteration space

parallel_for(
iChunk,
0,YourFunctorListCount,
[&](int ifnBegin, int ifnEnd) {
for(int ifn = ifnBegin; ifn < ifnEnd; ++ifn)
(*YourFunctorList[ifn])(&context); });

Jim Dempsey

knmaheshy2k · ‎10-13-2009

Thanks Jim. I will work keeping your comments in mind.

robert-reed · ‎10-13-2009

Quoting - jimdempseyatthecove

My comments suggested using parallel_for or parallel_for_each, the choice left up to the programmer. I illustrated parallel_for_each. The choice of template depends on the variances in estimated compute time for each function. When the function execution time is relatively the same then parallel_for can be used. When execution time varies greatly between functions then parallel_for_each might be in order. parallel_for_each in QuickThread may have lower overhead than parallel_for_each in TBB. The scheme in QT for parallel_for_each is to divide the iteration space up by the current number of available threads (current thread may be included or excluded from set), schedule an additional parallel_for_each for n-1 partitions and directly call for one of the partitions. The task for the n-1 partitions starts (now having one less available thread) the makes same division choice. At some point all threads (may optionally be subset of all threads on system) are active. At this point the sub ranges functions objectsare directly called (no furthertask enqueue/dequeue). As the sub ranges are processed by a shell function the scheduler is examined for idle threads (peek at one word). When idle thread is noticed by any of the remaining running shell functions then it will split its remaining range as done at the beginning. While parallel_for_each has more overhead than parallel_for it has the benefit of dynamic load balancing which helps whenthe functions in your functor list have indeterminant run times and when the system is running other applications than yours.

Sorry, Jim. I was going by the syntax of the TBB parallel_for_each, which matches that of the Microsoftversion. This is after all a forum on Threading Building Blocks, not QT. For each of these interfaces, access to the list is defined by STL-styleInput Iterators, whichsupport sequential,notrandom access lists and thus my previous comments. That thing in QT you call parallel_for_each in the example above is some different beastie altogether.

I also don't follow the logic suggested above regarding the choice of parallel_for versus parallel_for_each. TBB parallel_for does support load balancing, so if a couple of those functions take longer than the rest, the HW threads that call them get busy on those particular tasks and leave the untouched tasks for other HW threads to steal.If the tasks are all very small, I'd probably experiment with different grain sizes in the simple_partitioner to see where the balance point lives.

jimdempseyatthecove · ‎10-14-2009

No problem Robert.

Correct me if I am wrong... The load balancing of parallel_for partitions the iteration space up into a set of tasks with each task receiving a slice of the iteration space. The more slices the more task enqueue/dequeue overhead. And the number of slices is dependent on the thread pool size and/or with the specification of a grain size. Due to the partitioning of the iteration space occuring at the beginning (within template) you can identify a grain size that may keep the majority of threads busy but also potentially at the expense of incurring more tasking overhead. In the original post it was illustrated 1000 functions in the loop. While 1000 tasks would keep all threads busy, it would also incure the greatest amount of overhead for task queuing.

One technique that could be employed in TBB (as well as QuickThread) would be to time each function (the developer can do this during testing). Then using the timing information, create nfunctorlists (n=size of thread pool). Then launch n tasks, one for each functor list. With some finess, the timing can be done at run time with functors moved from one list to another.

The end-game goal being keeping the task scheduler out of the way of performing useful work.

BTW QuickThread has parallel_list for processing linked list iteration spaces (similar to but different from TBB's parallel_while)

Jim Dempsey

robert-reed · ‎10-14-2009

Quoting - jimdempseyatthecove

Correct me if I am wrong... The load balancing of parallel_for partitions the iteration space up into a set of tasks with each task receiving a slice of the iteration space. The more slices the more task enqueue/dequeue overhead. And the number of slices is dependent on the thread pool size and/or with the specification of a grain size. Due to the partitioning of the iteration space occuring at the beginning (within template) you can identify a grain size that may keep the majority of threads busy but also potentially at the expense of incurring more tasking overhead. In the original post it was illustrated 1000 functions in the loop. While 1000 tasks would keep all threads busy, it would also incure the greatest amount of overhead for task queuing.

Well, load balancing is not an attribute specific to the parallel_for but rather, isan attribute of the underlying scheduler. And parallel_for itself does not do the partitioning (there are several partitioners with different traits available) or the mapping of those partitions to the iteration spaces (handled by the blocked_range classes). The number of slices of the iteration space is dependent both on the blocked_range selected and the initial size of the space, though the auto_partitioner, which is the default in TBB 2.2, does also use the thread pool size as a factor in determining when to stop slicing. The slicing itself is much as you described previously for the QT parallel_for_each, a divide and conquer that splits the former iteration space in half. However, the partitioner only splits tasks on demand, not"within the template" as suggested above. Grain size is relevant only for the simple partitioner, where it results in less scheduler overhead, not more (leaves lists of tasks that can be executed in sequence with just a brief pass through the scheduler). The original post used 100 functions (not 1000) but eventuallyspeculated the possibility of 100,000. While there is more scheduling overhead for 1000 (or 100,000) tasks, that overhead is distributed both among the thread pool and in time (you might call it "lazy partitioning"). It can get in the way sometimes-I know ofone case where an OpenMP parallel for with thebasic scheduler operating over an array with very little work to do per element beats the equivalent TBB parallel for by several percent. However, in the context of calling 100 to 100,000 functions with presumably varying amounts of work, it should be a different story.

Quoting - jimdempseyatthecove

One technique that could be employed in TBB (as well as QuickThread) would be to time each function (the developer can do this during testing). Then using the timing information, create n functor lists (n=size of thread pool). Then launch n tasks, one for each functor list. With some finess, the timing can be done at run time with functors moved from one list to another.

I'm not sure what's intended here. Presumably dividing into n "functor" lists is intended to assign one list per thread. Not described is how to form these lists, though my guess based on contextis to balance the loads across the threads. Assuming that the functions in question individually execute with uniform durations (not guaranteed), it seems the goal would be to distribute the functions so as to have a combination of functions in each list that take roughly the same amount of time. That itself seems to be a combinatorial problem, especially if executed dynamically, that would be its own source of overhead.

jimdempseyatthecove · ‎10-14-2009

Robert,

OpenMP has a chunk argument for the parallel for schedule statement. When static scheduling is selected the slicing is fixed, you can elect to use a smaller chunk size than which evenly divides the iteration space in to number of thread number of pieces. When dynamic scheduling is selected the chunk size is reduced each time a thread completes a slice (sub range of iteration space). The art is in picking the correct scheduling method and (initial) chunk size such that the number of slices scheduled is small yet manages to keep all thread busy and/or finishes the complete iteration space in the shortest time. TBB partitioning (iteration space slicing) occurs up front.When a slice task starts its iterator has a fixed slice of the original iteration space. The parallel_for template (combined with internal functionality of your scheduler) will make an up-front decision as to the number of slices (i.e. tasks) to make. The art (for TBB) is in picking the number (size) of slices such that all slices finish in the shortest amount of time. And the programmer may use hints (schedule w/wo chunk in OpenMP, choice of iterator w/wo grain size in TBB). The problem though, which is addressed to some extent with the OpenMP dynamic and chunk is that all cores/HW threads of the system are not always available and dedicated to the application. The iteration space slice-up is purely speculative.

An alternative is to not use a general purpose technique but instead write a specific purposed method (with potential of becoming generalized). A list of functor lists might be one such example for handling a large number of arbitrary functions that can be run in parallel in each iteration of a process loop.

Jim Dempsey