Software Archive
Read-only legacy content
17061 Discussions

Q about fibonacci example

Victor_E_1
Beginner
738 Views
//From the examples website: https://www.cilkplus.org/download#block-views-code-samples-block-1
int fib(int n)
{
    if (n < 2)
        return n;
    int x = cilk_spawn fib(n-1);
    int y = fib(n-2);
    cilk_sync;
    return x + y;
}

This leads to much recalculation of old fib values. A better implementation would store the value, and return it if it has already been computed. However, that also requires critical sections or locks to serialize access to "has_already_been_computed" variable. Does cilk have a mechanism for that?

I'm looking to implement a general DAG code, where every node evaluates its predecessors. Each node should be evaluated only once, after which it becomes possible to retrieve its value. Right now it seems to me that cilk lacks the mechanism for that. 

What feature am I missing?

Victor.

0 Kudos
6 Replies
Bradley_K_
New Contributor I
738 Views

If you actually want to compute quickly, you should not use the algorithm that is exponential in n.  There is an algorithm to do it in time log n.  

The point of this example is to show how to write  a simple parallel program with Cilk.  Not to compute fib.

For DAG codes, take a look at these two papers:

Tim Kaler, William Hasenplaugh, Tao Schardl and Charles LeisersonExecuting Dynamic Data-Graph Computations Deterministically Using Chromatic Scheduling, in SPAA 2014.

William Hasenplaugh, Tim Kaler, Charles Leiserson and Tao Schardl. Ordering heuristics for parallel graph coloring, in SPAA 2014.

 

 

0 Kudos
Victor_E_1
Beginner
738 Views

Thanks for replying. But I'm not really interested in graph algorithms: I know how to traverse my graph, I just don't know how to implement it in cilk.

Here is the code I want to parallelize. Obviously, the "p->execute" call can be prefixed with a spawn, and before the "local_execute" there has to be a sync. However, how to you prevent a task from getting multiply executed? If several threads hit the first conditional and think it's not true, they will all calculate redundantly. In other systems you would solve this with a lock or a critical section. How do you solve it in Cilk?

 

task::execute() {
 if (this->has_been_executed) return;
 for (p in this->predecessors())
    p->execute();
 this->local_execute();
 this->has_been_executed = 1;
 return;
}

 

0 Kudos
Bradley_K_
New Contributor I
738 Views

The algorithms in those papers are suited for Cilk.  The idea is that first you color the graph.  The colors have the property that if two nodes are the same color then there is no race if you update both nodes at the same time.  Then you execute each color in a parallel loop.  The coloring itself can be parallelized too.

As to your proposed approach: you probably need something like locks.  But locks are too strong, since they block a thread from running.  If someone already is running it, there is no reason for us to wait on the lock.  So I've reformulated this so that there is a class like this

class node {
  std::atomic<int> count;
  int degree;  
  std::vector<node*> successors;
  void execute() {
    if (count->fetch_add(1)+1 == degree) {
      cilk_for(auto s : successors) {
        s->execute();
      local_execute();
      }}}}
    

Rather than starting at the end of the graph and running the previous things, I start at the beginning and run forward.  Maybe you can figure out how to reformulate it to go back to front.

This code requires some way to reset all the counts to zero before running execute on the root of the graph, and some way to build the graph.

This code might run OK, but I'd be pleasantly surprised if it actually gives good performance.

-Bradley

 

0 Kudos
Victor_E_1
Beginner
738 Views

Isn't the whole point of DAG systems like Cilk that you don't have to do the colouring? You just insert tasks in a queue with their dependencies, and the runtime will execute them in the right order. Since the runtime will not devote a tread to a task that still has unfulfilled dependencies you get maximal parallel speedup. I'm not sure what colouring would buy you. 

Besides, both in the Fibonacci case and my application there are very few independent tasks, since the whole DAG is just a a long list of wavefronts, each of fairly small size. Say a million nodes in the graph, but never more than a few dozen ones ready to execute.

Anyway, "std::atomic"..... I didn't know that existed. So Cilk does not have enough expressive power by itself for such algorithms? Sounds like I should explore the multi threading of C++ 11/14 instead.

 

V.

0 Kudos
Bradley_K_
New Contributor I
738 Views

There are several reasons that Cilk is the way it is.  The system has provably good runtime, unlike the other approaches; if you measure the work and span, you can predict the performance using W/P+S, where W is the work, S is the span, and P is the number of processors.  The system has tools to measure work and span.  The system has a provably good race detector.  Cilk can handle the situation where one processor is getting less work done than the others (perhaps because another process is running on that processor), and automatically load balances.

The point isn't that you don't have to do the coloring.    You still need to come up with some way to deal with races for the situation where you cannot express all the dependencies as control dependencies.

To say that Cilk doesn't have enough power to do something is a little off the point.  Cilk is part of C++, so you have the full power of C++: you can vectorize.  You can write races.  You can use locks.

The real point of Cilk is to enable you to write high-performance reliable programs.  

You may be able to code your application in C++11 threads, but there are usually no performance guarantees from the scheduler, and the tools for finding races aren't accurate.

If you have only a dozen things to do at a time, then your program will get only a dozen-fold speedup, in the best case.  If the cost of local_execute is less than several thousand instructions, the scheduling overheads are going to cause difficulty for you no matter what you do, and I will be surprised if your program runs faster on multiple cores than as a single-threaded code without any parallel overheads (e.g., your original code with no std::atomic).  It doesn't matter whether it's CIlk, or pthreads.  If you want your graph traversal to run fast, the best you may be able to do is to order the nodes with a topological sort, and then run them one after another with no recursion, but simply as

for (n : topologically_sorted_nodes) n->local_execute();

This assumes that you want to execute the graph many times after constructing it.

0 Kudos
Jim_S_Intel
Employee
738 Views

Victor,

 There are a lot of ways I might answer your questions and address some of these issues.   I may have missed something, but here are a few points:

  1. If I wear my Intel hat, the short answer to your original question is, there is not currently an official language feature (as far as I know) in the Cilk Plus product that does exactly what I think that you are thinking of right out of the box.     That doesn't mean there aren't ways to execute this kind of code in Cilk Plus (e.g., see some of my later comments below).   But not everything that people know how to do with Cilk-style code has made it into the product.   That also doesn't mean that a new features couldn't be added in the future, if there is something that turns out to be useful and in-demand.   One of the challenges to new features is making sure that they play nicely with existing constructs, and that they don't mess up performance on the commonly used Cilk constructs.   Another challenge is, that a lot of new features that sound like good ideas at first often turn out to be less exciting than people think they will be.    In any case, any additional information you can share about your use case would be interesting.    
    There are also some other product options available today that you might be able to work.   (Flowgraph in Intel Threading Building Blocks might be feasible, although I don't know much about it myself.)

     
  2. Cilk is not as a system for executing DAGs.  It is a language that lets you express parallelism, and using it, you can conceptually execute something that matches a certain large class of DAGs, using spawn and sync.     But the spawn and sync keywords are not / were never designed to express and  execute an arbitrary DAG, i.e., one with arbitrary edges going from any node to any other node.   Instead, Cilk limits itself in the kinds of DAGs it can express.    (That class of DAGs roughly goes by many names, such as strict fork-join, series-parallel, etc., with many names not meaning exactly the same thing and not always being completely precise, but that's a different discussion).
     
  3. Why do we impose a limit at all / why is the Cilk language currently the way it is?   One can dive really deep into the technical reasons and various tradeoffs, but there are two fundamental intuitions that I find useful to keep in mind:
    1. Structured DAGs can be optimized more easily and efficiently.  Arbitrary structure in a DAG means more bookkeeping (and thus more expensive synchronization operations), because you have  keep track of all the edges in the DAG.  With the structure imposed by spawn/sync, there are a lot more edges that can be implicit, and more optimizations, etc. that the Cilk runtime scheduler can rely on. 
    2. Arbitrary DAGs can require a lot of space, and can be difficult to express.  Short of building the entire graph and storing all the edges, I don't know of a lot of compact ways to express an arbitrary DAG.   
      In contrast, consider the fib(n) example: conceptually, the number of tasks in fib(n) is exponential in n; the Cilk runtime couldn't actually construct the entire DAG graph ahead of time for a computation like fib(43) before starting, or keep it around for the entire duration, because that would fill up memory on the machine.   Something more complicated is be going on underneath, to reuse space for "tasks", etc.

      (Now, in your example, it sounds like this latter part may be less of an issue, if you already need to build the entire graph, for other reasons.   But you will also need to allocate extra space for bookkeeping.)
  4. Comparing C++ 11/14 features for multithreading and Cilk Plus is a whole different can of worms that I won't even try to touch here. :)


Finally, if I remove my Intel hat; and engage in some shameless self-promotion, I can offer one last suggestion.     In the past, I have worked on a Nabbit, a research prototype of a Cilk library that may do what you have in mind for task-graph computations.     The source for that library is available online, and should work with the Intel compiler:

https://github.com/jsukha/nabbit

Note, that I wrote this code several years ago, when I was arguably young and naive.  There are a bunch of places in the code that make me cringe now. :)     In particular, the interface that I had in this library probably isn't as nice as it could be.   The way I had things set up was to require the user to subclass a special type of node, and define a "compute" method for it.    Would this kind of library interface work for your application, or do you need something different?    

I haven't worked on Nabbit for while because I hadn't come across a good application to drive it.   But if you have a use case that would be a good example, let me know, and I can try to make improvements.


As Bradley discussed though, granularity and scheduling overheads is still key.   Having a lot of small fine-grained tasks is going to be more painful to execute than larger coarse-grained tasks in any platform no matter which language/platform that you use to program it, and Nabbit won't be any exception.    If your problem is a DAG with as a lot of tiny tasks, then you may eventually need to find some way to group together tasks somehow (e.g., exploit structure in the DAG to cluster tasks, or do graph coloring, or impose extra dependencies, etc).
That is where some of Bradley's suggestions might come into play.   If the straightforward approaches to executing your DAG don't give good performance, then you may need to do something more complicated.

Cheers,

Jim

 

0 Kudos
Reply