Question about steal-continuation semantics in Cilk Plus, Global counter slowing down computation, return value of functions

Robert_M_ · ‎03-08-2014

1)
What I understood about steal-continuation is, that every idle thread does not actually steal work, but the continuation which generates a new working item.
Does that mean, that inter-spawn execution time is crucial? If 2 threads are idle at the same time, from what I understand only one can steal the continuation and create its working unit, the other thread stays idle during that time?!

2)
As a debugging artefact, I had a global counter incremented on every function call of a function used within every working item.

I expect this value to be wrong (e.g. lost update), as it is not protected by a lock. what I didn't expect was execution time being 50% longer. Can somone tell me, why this is the case?

3)
Du I assume correctly, that a cilk-spwaned function can never (directly) return a result, as the continuation might continue in the mean time and one would never know when the return value is actually written?

TimP · ‎03-08-2014

2)

A race condition incurs the same cache issues as false sharing. Each time a worker reads or updates a cache line which has been modified by another worker, the cache line must first be updated across all levels of cache and memory.

Jim_S_Intel · ‎03-08-2014

For (1) and (3), it is useful to consider a short code example.

int f() {
    int x, y;
    x = cilk_spawn g();
    y = h();
    cilk_sync;
    return x+y;
}

In this example, the worker thread that starts executing f() will also start executing g(). An idle worker might steal the continuation of the cilk_spawn of g(), meaning that it may start executing the code "y = h()". Assuming that f() is the only thing that happens in the program, and that g() and h() are not themselves parallel, then there is only work for 2 workers in this example. Using three workers won't help unless there is either nested parallelism inside g() or h(), or something else is allowed to executing in parallel with f() itself. Fortunately, Cilk Plus is designed to efficiently handle recursive / nested parallelism, so exposing more parallelism is often not a problem.

As far as (3), it is correct to say that you shouldn't look at (or change) the value of "x" in the continuation, because that would be a race. After the cilk_sync, however, it is safe to use "x" again.

I'm not quite sure what you mean by "inter-spawn execution time", but hopefully that helps answer some of your questions.
Cheers,

Jim

Barry_T_Intel · ‎03-09-2014

To really understand this, we need to go a little deeper.

When the compiler sees the statement

x = cilk_spawn g();

it generates a "spawn helper" function. The spawn helper performs a number of tasks.

It encapsulates and constructs any temporary variables needed by the spawned function. This means that they won't go out of scope until g() returns.
It "detaches" the spawned function. Each worker maintains a deque of spawned functions. When a function detaches, it pushes an entry onto the tail of the worker's deque, indicating that the continuation of the spawning function is available to be stolen.
It calls the spawned function.
It executes the assignment into the return value, again guaranteeing that the value doesn't go out of scope until g() returns.
Any temporaries are destructed.
The deque is checked. If the parent's continuation has *not* been stolen, the deque entry is popped off the tail of the worker's deque, and the spawn helper returns normally. If the parent's continuation *has* been stolen the Cilk runtime decrements the "join counter" it maintains for the spawning function. If the join counter is now 0, it means that this worker is the last one to the cilk_sync. Execution resumes after the cilk_sync in the spawning function. If the join counter is not 0, then there are other workers which have not yet reached the cilk_sync. This worker falls out to the "steal loop" and searches for work to do. Obviously the manipulations of the deque and join counter are done *very* carefully to prevent races.

Let's consider a system with 3 workers, called w0, w1 and w2. At the start of Jim's example, w0 is executing f(), and w1 and w2 are "idle." "Idle" means that they are both executing the "steal loop" in the Cilk runtime, randomly selecting a worker and seeing if there's any work for them to steal. When w0 executes the cilk_spawn statement an entry is pushed onto the tail of w0's as described above. Since both w1 and w2 are looking for work, the first one that finds the entry will pop it off the head of w0's deque, increment the join counter, and execute the continuation. The "losing" worker will keep looking for work until it finds some, or you exit the parallel region and the system worker threads (the workers created by the Cilk runtime) are all stopped.

When the continuation reaches the cilk_sync statement, it calls into the Cilk runtime which will decrement the join counter. If the result is not zero, the worker will fall out to the steal loop and search for additional work to do. If the join counter is zero, execution will continue on that worker.

This means that there is no guarantee which worker will continue from a cilk_sync statement, and Cilk applications should not use Thread-Local Storage (TLS). The only exception to this is that the top-level spawning function (the function that does the first cilk_spawn) is guaranteed to exit on the same thread it was entered.

I hope this makes things a little clearer.

- Barry

Robert_M_ · ‎03-10-2014

Thank you for you answers.

(1)
Rethinking my question 1, I already figured out my mistake in thinking. There is not THE continuation, but A Continuation which is stolen, as long as the spawned functions fork out further functions as they should.

by "inter-spawn execution time" I mean whatever happens in a function in between two spawns. If there is really only one continuation available, which forks out non-forking children, then this would actually be a bottleneck. As this is usually not the case due to recursive forks within the children, this is not a problem

(2)
This tremendous effect is caused only by cache-misses due to invalidation? wow!

(3)
@Barry: How exactly is TLS used in spawned functions? I want to make sure that I don't accidentally use it.

Barry_T_Intel · ‎03-10-2014

RE: How exactly is TLS used in spawned functions? I want to make sure that I don't accidentally use it.

That's going to depend on your program and compiler.

On Windows, a variable can have the __declspec(thread) attribute, indicating that it's allocated in Thread-Local storage.
Using a GCC-based compiler, a variable can have the __thread attribute.

- Barry