Re: lengthy postponed shared data initialization & thread locki

vanswaaij · ‎12-12-2008

I understand from the docs that if e.g. tbb has 2 physical threads but many tasks, if two tasks reach the same lengthy postponed initialization code of shared data, and thus one task must wait for the other to finish (through a mutex), the thread running that waiting task, will wait as well, even as many other tasks are available to be executed.

If I'm right with this assumption, what would be a possible solution to this problem? Note that the reason for postponement is the fact that not all of this shared data is eventually initialized, and thus time is saved (in the sequential code at least)

Thanks for any insights.

Dmitry_Vyukov · ‎12-13-2008

Quoting - vanswaaij

I understand from the docs that if e.g. tbb has 2 physical threads but many tasks, if two tasks reach the same lengthy postponed initialization code of shared data, and thus one task must wait for the other to finish (through a mutex), the thread running that waiting task, will wait as well, even as many other tasks are available to be executed.

If I'm right with this assumption, what would be a possible solution to this problem? Note that the reason for postponement is the fact that not all of this shared data is eventually initialized, and thus time is saved (in the sequential code at least)

Thanks for any insights.

There are many solutions to the problem. Some are easier to implement, some are more kosher.

Here is one variant. It does deferment of tasks if needed resource is still not initialized. Intended usage is that every task that needs some resource must be passed through resource's spawn() method, so that when the task will be executed it can access the resource w/o blocking.

I.e. initialization of a resource does not block other tasks, they just deferred until the resource will be fully initialized.

[cpp]template
class resource_wrapper : nocopy
{
public:
    resource_wrapper()
        : state_(state_uninitialized)
        , resource_()
    {
        deferred_tasks_.reserve(8);
    }

    ~resource_wrapper()
    {
        delete resource_;
    }

    void spawn(tbb::task* parent, tbb::task* child)
    {
        state_t prev_state;
        {
            lock l (mtx_);
            prev_state = state_;
            if (state_ == state_uninitialized)
            {
                state_ = state_initializing;
            }
            else if (state_ == state_initializing)
            {
                deferred_tasks_.push_back(std::make_pair(parent, child));
            }
            else if (state_ == state_initialized)
            {
                // no-op
            }
        }
        if (prev_state == state_uninitialized)
        {
            // lengthy initialization, doesn't block other threads
            resource_ = new resource_t;
            {
                lock l (mtx_);
                assert(state_ == state_initializing);
                state_ = state_initialized;
            }
            for (size_t i = 0; i != deferred_tasks_.size(); ++i)
            {
                deferred_tasks_.first->spawn(deferred_tasks_.second);
            }
        }
        else if (prev_state == state_initializing)
        {
            // no-op
        }
        else if (prev_state == state_initialized)
        {
            parent->spawn(child);
        }
    }

    resource_t& resource()
    {
        assert(resource_);
        return *resource_;
    }

private:
    enum state_t {state_uninitialized, state_initializing, state_initialized};
    mutex       mtx_;
    state_t     state_;
    resource_t* resource_;
    std::vector deferred_tasks_;
};
[/cpp]

It must do the thing. What do you think?

Dmitry_Vyukov · ‎12-13-2008

Quoting - Dmitriy Vyukov

There are many solutions to the problem. Some are easier to implement, some are more kosher.

Here is one variant. It does deferment of tasks if needed resource is still not initialized. Intended usage is that every task that needs some resource must be passed through resource's spawn() method, so that when the task will be executed it can access the resource w/o blocking.

I.e. initialization of a resource does not block other tasks, they just deferred until the resource will be fully initialized.

This solution can be generalized for the case when task requires several resources as well.

Btw, Arch, if your proposal wrt better DAG support will be incorporated into TBB, then, I think, it will be possible to provide much better experience for the end user. I.e. there will be no need for continuations (i.e. breaking task into several tasks), initialization of the resource ~~will be synchronous, but will still not block progress:~~

[cpp]class my_task
{
    void execute()
    {
        ...;
        // possibly lengthy initialization
        // but while one thread does intialization,
        // other threads in stealing mode
        resource1& r1 = resource_wrapper1.resource();
        resource2& r2 = resource_wrapper2.resource();
        // work with r1 and r2
        ...;
    }
};
[/cpp]

Here is implementation sketch:

[cpp]template
class resource_wrapper : nocopy
{
public:
    resource_wrapper()
        : state_(state_uninitialized)
        , resource_()
    {
        deferred_tasks_.reserve(8);
    }

    ~resource_wrapper()
    {
        delete resource_;
    }

    resource_t& resource()
    {
        state_t prev_state;
        {
            lock l (mtx_);
            prev_state = state_;
            if (state_ == state_uninitialized)
            {
                state_ = state_initializing;
            }
            else if (state_ == state_initializing)
            {
                tbb::task* t = &tbb::task::self()
                assert(t->ref_count() == 0);
                t->set_ref_count(1);
                deferred_tasks_.push_back(t);
            }
            else if (state_ == state_initialized)
            {
                // no-op
            }
        }
        if (prev_state == state_uninitialized)
        {
            // lengthy initialization, doesn't block other threads
            resource_ = new resource_t;
            {
                lock l (mtx_);
                assert(state_ == state_initializing);
                state_ = state_initialized;
            }
            for (size_t i = 0; i != deferred_tasks_.size(); ++i)
            {
                deferred_tasks_->decrement_ref_count();
            }
        }
        else if (prev_state == state_initializing)
        {
            // go to stealing mode while waiting for initialization
            tbb::task::self().wait_for_all();
        }
        else if (prev_state == state_initialized)
        {
            // no-op
        }
    }

    resource_t& resource()
    {
        assert(resource_);
        return *resource_;
    }

private:
    enum state_t {state_uninitialized, state_initializing, state_initialized};
    mutex       mtx_;
    state_t     state_;
    resource_t* resource_;
    std::vector<:TASK> deferred_tasks_;
};
[/cpp]

Dmitry_Vyukov · ‎12-13-2008

Quoting - Dmitriy Vyukov

This solution can be generalized for the case when task requires several resources as well.

Btw, Arch, if your proposal wrt better DAG support will be incorporated into TBB, then, I think, it will be possible to provide much better experience for the end user. I.e. there will be no need for continuations (i.e. breaking task into several tasks), initialization of the resource ~~will be synchronous, but will still not block progress:~~

Damn! It's Cilk++'s HyperObjects:

http://www.cilk.com/multicore-products/cilk-hyperobjects/

It's a kind of non-blocking generalization of mutual exclusion backed up by scheduler. Task tries to get exclusive access to some resource, and if attempt fails task is suspended and the thread goes to stealing mode. When task ends work with some resource it wakes up suspended tasks.

vanswaaij · ‎12-15-2008

Quoting - Dmitriy Vyukov

Quoting - Dmitriy Vyukov

There are many solutions to the problem. Some are easier to implement, some are more kosher.

Here is one variant. It does deferment of tasks if needed resource is still not initialized. Intended usage is that every task that needs some resource must be passed through resource's spawn() method, so that when the task will be executed it can access the resource w/o blocking.

I.e. initialization of a resource does not block other tasks, they just deferred until the resource will be fully initialized.

This solution can be generalized for the case when task requires several resources as well.

Btw, Arch, if your proposal wrt better DAG support will be incorporated into TBB, then, I think, it will be possible to provide much better experience for the end user. I.e. there will be no need for continuations (i.e. breaking task into several tasks), initialization of the resource ~~will be synchronous, but will still not block progress:~~

[cpp]class my_task
{
    void execute()
    {
        ...;
        // possibly lengthy initialization
        // but while one thread does intialization,
        // other threads in stealing mode
        resource1& r1 = resource_wrapper1.resource();
        resource2& r2 = resource_wrapper2.resource();
        // work with r1 and r2
        ...;
    }
};
[/cpp]

Here is implementation sketch:

[cpp]template
class resource_wrapper : nocopy
{
public:
    resource_wrapper()
        : state_(state_uninitialized)
        , resource_()
    {
        deferred_tasks_.reserve(8);
    }

    ~resource_wrapper()
    {
        delete resource_;
    }

    resource_t& resource()
    {
        state_t prev_state;
        {
            lock l (mtx_);
            prev_state = state_;
            if (state_ == state_uninitialized)
            {
                state_ = state_initializing;
            }
            else if (state_ == state_initializing)
            {
                tbb::task* t = &tbb::task::self()
                assert(t->ref_count() == 0);
                t->set_ref_count(1);
                deferred_tasks_.push_back(t);
            }
            else if (state_ == state_initialized)
            {
                // no-op
            }
        }
        if (prev_state == state_uninitialized)
        {
            // lengthy initialization, doesn't block other threads
            resource_ = new resource_t;
            {
                lock l (mtx_);
                assert(state_ == state_initializing);
                state_ = state_initialized;
            }
            for (size_t i = 0; i != deferred_tasks_.size(); ++i)
            {
                deferred_tasks_->decrement_ref_count();
            }
        }
        else if (prev_state == state_initializing)
        {
            // go to stealing mode while waiting for initialization
            tbb::task::self().wait_for_all();
        }
        else if (prev_state == state_initialized)
        {
            // no-op
        }
    }

    resource_t& resource()
    {
        assert(resource_);
        return *resource_;
    }

private:
    enum state_t {state_uninitialized, state_initializing, state_initialized};
    mutex       mtx_;
    state_t     state_;
    resource_t* resource_;
    std::vector<:TASK> deferred_tasks_;
};
[/cpp]

Thanks, I get the jist of the solution, not the details but that's because I have had no need to go beyond the different looping templates. Would it be possible to implement this in the context of say a parallel_for?

I guess if there is enough work to do, locking would never have to incur any waiting by others if implemented in the above way.

Thanks again.

Dmitry_Vyukov · ‎12-19-2008

Quoting - vanswaaij

Thanks, I get the jist of the solution, not the details but that's because I have had no need to go beyond the different looping templates. Would it be possible to implement this in the context of say a parallel_for?

I guess if there is enough work to do, locking would never have to incur any waiting by others if implemented in the above way.

Yes, it's possible to use that with parallel_for. You just have to replace your resources with wrapped resources. However it was very crude and fast sketch, probably it doesn't work at all, maybe someone from TBB team will validate the idea.

Yes, there will be no [long] blocking (however short blocking is possible, but it's possible to make lock-free fast-paths in order to eliminate all blocking possibilities from fast-path).

Dmitry_Vyukov · ‎12-19-2008

Quoting - Dmitriy Vyukov

Yes, it's possible to use that with parallel_for. You just have to replace your resources with wrapped resources. However it was very crude and fast sketch, probably it doesn't work at all, maybe someone from TBB team will validate the idea.

Yes, there will be no [long] blocking (however short blocking is possible, but it's possible to make lock-free fast-paths in order to eliminate all blocking possibilities from fast-path).

Oh, forgot to mention that naive usage of my proposal can probably lead to deadlocks if there are cyclic dependencies between tasks.

Alexey-Kukanov · ‎12-19-2008

Quoting - Dmitriy Vyukov

Yes, it's possible to use that with parallel_for. You just have to replace your resources with wrapped resources. However it was very crude and fast sketch, probably it doesn't work at all, maybe someone from TBB team will validate the idea.

The idea seems viable, though the sketch won't work. First, somehow you ended up with two methods having the same signature. Also, avoiding to take a lock when accessing already initialized resource seems important enough; of course it requires memory fences when reading & writingthe state.Then, in order to use wait_for_all, the reference count should be set to 2 initially (one for the dependence on resource, and one for the wait_for_all call). Last but not least, wait_for_all won't exit until all tasks in the local pool are dispatched; Dmitry of course knows that, but it might be a surprise for the users of this class (sort of a Promise.)

Dmitry_Vyukov · ‎12-24-2008

Quoting - Alexey Kukanov (Intel)

The idea seems viable, though the sketch won't work. First, somehow you ended up with two methods having the same signature. Also, avoiding to take a lock when accessing already initialized resource seems important enough; of course it requires memory fences when reading & writingthe state.Then, in order to use wait_for_all, the reference count should be set to 2 initially (one for the dependence on resource, and one for the wait_for_all call). Last but not least, wait_for_all won't exit until all tasks in the local pool are dispatched; Dmitry of course knows that, but it might be a surprise for the users of this class (sort of a Promise.)

Damn! The second resource() method must be removed.

Yes, double-checked initialization idiom can (must) be applied here. With true or induced data-dependency it will have basically no cost (no fences) once resource is initialized. I omitted it just for clarity.

While user algorithm doesn't have infinite task chains and doesn't require concurrency, I think that the fact that wait_for_all won't exit until all tasks in the local pool are dispatched won't harm - threads are just doing useful work, this work or that work doesn't really matter.

jimdempseyatthecove · ‎12-30-2008

I don't know if I can frame my thoughts properly in words.

Can you rework the code such that the "lengthly postponed shared data initialization" is not surrounded with a lock and terminates by setting an initialization done condition.

Rather, have the "lengthly postponed shared data initialization" upon completion spawn the task cascade that becomes the application dependent on the initialization data.

IOW, up until initialization is complete, you will never have tasks enqueued that will be dependent on the data. And therefore you will never incur a blocking situation for the initialization data.

Jim Dempsey

vanswaaij · ‎02-04-2009

Quoting - jimdempseyatthecove

I don't know if I can frame my thoughts properly in words.

Can you rework the code such that the "lengthly postponed shared data initialization" is not surrounded with a lock and terminates by setting an initialization done condition.

Rather, have the "lengthly postponed shared data initialization" upon completion spawn the task cascade that becomes the application dependent on the initialization data.

IOW, up until initialization is complete, you will never have tasks enqueued that will be dependent on the data. And therefore you will never incur a blocking situation for the initialization data.

Jim Dempsey

I see your point, but no not really. It is unknow ahead of time which uninitialized shared data will be encountered and the data can't know which tasks will need it. The application is a raytracer and the shared data is the surface description of an object hit by a ray. The object can be hit by many rays coming from any direction, or not get hit at all :)

RafSchietekat · ‎02-05-2009

"I see your point, but no not really." Did you? At least in principle it seems a fairly straightforward program transformation: keep the mutex, but, instead of waiting for notification from a condition variable, split off a continuation task and register it with the data, which, instead of calling notify_all() on a condition variable, just spawns any registered continuations. The registration list can be a singly linked list, where you just add to the front; I'm not sure whether the order of spawning matters. I admit that I am reluctant about this whole continuations business myself (it feels too much like doing extra low-level plumbing), but that's what's required at this time, and it should get you what you want. Really.

RafSchietekat · ‎02-05-2009

Or maybe not (sorry)... there might be a similar problem as with futures (dependency inversion), depending on what a ray tracer has to do exactly (I don't know enough about that). Is that what you meant? But wouldn't you have similar problems with condition variables, and if so, how is it solved there?

jimdempseyatthecove · ‎02-05-2009

Quoting - vanswaaij

I see your point, but no not really. It is unknow ahead of time which uninitialized shared datawill be encountered and the data can't know which tasks will need it. The application is a raytracer and the shared data is the surface description of an object hit by a ray. The object can be hit by many rays coming from any direction, or not get hit at all :)

Initialization:

Assume you have a block of uninitialized data. Assume further that this uninitialized data is not always required but may at times be required by more than one task. Assume further that this uninitialized data need be initialized only once.

The coding practice I would favor would be to protect the uninitialized data with an atomic state variable. e.g.

volatile long uiState = 0;
// 0 == Uninitialized
// 1 == Initialization in progress
// 2 == Initialized

// arbitrary thread
if(uiState == 0)
{
if(InterlockedCompareExchange(&uiState, 1, 0) == 0)
{
InitializeData(); // may be multi-threaded
uiState = 2; // could be setinside of and at end of InitializeData
}
}
else
{
while(uiState == 1)
WorkSteal();
}
... go about your business

Encapsulate the above into a class if you wish, such that the same can be applied to different collections of uninitialized data.

Ray tracing:

There are multiple ways to perform ray tracing, some are simple, some are complex, some are optimized, some are not. I will not present an argument for the best way to perform ray tracing but instead present a generalization.

Each object in a system of objects may or may not have a surface. For each object with a surface you may or may not elect to dissect the surface into patches. Each patch may or may not catch light from potential light sources. The caught light reaction with the surface patch may or may not be dependent upon incident angle, intensity, color of light, texture of surface, albino, temperature, etc... The reaction generally results in the incoming light becoming a light source.

For each light source, it may be a point source or a surface source, the color and intensity may vary and may vary with angular vector from normal of surface or from incident angle of incoming light in case of reflected (described above). Once emitted the light may pass through a medium or vacuum. Further some of the objects may exhibit transparency of varying extentbut with index of refraction as well as exhibit apro-ration of reflection, refraction, absorption, polarization, etc...

IOW Ray tracing is not simple, I am sure I missed a few things.

For ray tracing you have a number of sources and a number of sinks (sinks becoming sources as explained above).

The general concept for ray tracing is to run a permutation of all sources and sinks (with optimizations to eliminate dead zones such as occultations).

For each patch then, the number of permutations to light sourceis unknown and varies with time (excepting for static display).

Or conversely, for each light source, the number of permutations to patches is unknown and varies with time (excepting for static display).

You could run the major loop either on patches or light sources.

The general solution might consist of an iterative process whereby the first pass assumes each patch is absent of light and the first pass then will compute the all sources to patch contribution and in the process computes reflected and refracted light, the second pass would consider the reflected and refracted light as a secondary light source and the same process as the first pass is run excepting now this adds to the contribution of the light striking the patch. As an optimization, and signal for termination, statistics are maintained as to the change in contribution of light to each patch (and from what source) as well as the maximum change to any one patch. This second process is repeated excluding patch/light source combinations that fall below a minimum threshold of contributionuntil all are below the minimum threshold at which point you attained your acceptable stasis.

The above may be truncated if you want speed over quality.

Since the permutation path for each pathis large and in-determinant. The coding technique I would recommend is to process the list of patches in paralleloff of an iterator picking one patch at a time or areasonabe size but diminishing set each time of pick set.Presumably you would have many more patches than you have cores available. I would not anticipate the need for spawning tasks in the loops at lower (recursive)levels. As an optimization though, when the threads finish up at a large enough time difference you could set a flag to enable timing of the patch calculationfor each patchprocessedthe outer loop. Then after next pass, re-order the patch list (longest time first) and reset the sample flag. At some point you will stop re-ordering the patch list and then periodically a re-order may be required.

At completion of outer loop (with possibility of early termination) the physics section results datawould be used to affect the object motions.

The ray tracing process, as described above, is not an example of in-determinant waiting of uninitialized data.

Assuming though, while you are running the above, you have a display update task (which may be multi-threaded), whereby you start a snapshot (performed as a mosaic by several threads) whereby the snapshot data is your uninitialized data and may come in out of sequence. As the sequences complete you may wish to shove them into the display adapter (out of order processing).

Due to the fact that the update task can take advantage of multi-threading, it would be desirable for the relatively long processing time of patch to light source to be suspend-able when a thread of the display update task demands service.

For this, I would recommend NOT scheduling more threads of higher priority as is often done. Instead, I would recommend an acquiescent process modelwhereby each thread in the ray tracing section would periodically check a display refresh request pending flag, and if set, simply perform a task stealing call (as opposed to a higher overhead task switch, or even higher overhead system thread context switch)

// sprinkled in ray tracing code
while(DisplayRefreshTaskStealingRequested)
DisplayRefreshTaskStealing();

This is particularly easy to do and has Spartan overhead expense.

Jim Dempsey

RafSchietekat · ‎02-05-2009

Jim, thanks for the notes on ray tracing, although I have a feeling that Maurice (presumedly) already knows a thing or two about this subject.

It seems that I somehow overlooked some earlier messages that made mine largely redundant (sorry again), but I'm still curious about the questions in #12.

jimdempseyatthecove · ‎02-05-2009

Quoting - Raf Schietekat

Jim, thanks for the notes on ray tracing, although I have a feeling that Maurice (presumedly) already knows a thing or two about this subject.

It seems that I somehow overlooked some earlier messages that made mine largely redundant (sorry again), but I'm still curious about the questions in #12.

Ray tracing is probably a bad choice for examining dependency inversion since the work distribution would likely be best handled by tasking only at the outer level.

Dependency inversion detection is difficult due to different programming styles. In the threading library I produce (QuickThread) there is a special test for this. Actually it is a test for self dependency whereby as you task steal and nest deeper you might end up task stealing at a lower level and take yourself from a higher level whenyour past dependency had completed but prior to unwinding the task stealing nesting level stack. The consequence of not performing the detection is an infinitely growing stack level or stalled task stealing threads. Neither of which are productive (although I would rather have a stall than an crash).

When I converted the tryHavok Smoke demo for use with QuickThread it became quickly evident the defensive code was overly protective since Havok uses the same functions to dispatch nested tasks. This resulted in the self dependency detection logic triggering a false positive. (poor performance resulted due to stalling threads). After fixing this, performance was restored.

I am trying to ge Dmitriy to run the Smoke demo for me. My system is Q6600 running Windows XP Pro x64 and the tryHavok Smoke Demo comes with 32-bit libraries. The demo runs in 32-bit mode on the 64-bit O/S but it crawls along (or should I say thunks along). Dmitriy has a 32-bit version of Windows XP and should have no problems in running the test (other than for the lengthy downloads).

Jim Dempsey

Blue_Sky_Studios · ‎04-30-2009

I want to implement the solution recommended by Vyukov in #4. I'm new to this, so pardon my silly question, if my questions have trivial answers. The keys to understanding the implementation seem to be the methods for suspending a task and then reawakening suspended tasks after the resource is initialized.

(1) It seems to me that the code would suspend a task that must wait for pending initialization by setting its ref count to 1 and calling wait_for_all(). Kukanov's comment in #7 says that the ref count should be 2. If that is true, the wake-up code would decrement the count to 1, and then what code decrements it to 0, and when does that happen. What is happening with the ref count?

(2) The wake-up code calls decrement_ref_count(), but there is no documented function like that. Is there an implied function that sets the ref count to ref_count() - 1? The docs say that ref_count is only intended for debugging. Would I have to lock the code that calls set_ref_count (ref_count() - 1), or is there some other method for atomically decrementing the ref count?

RafSchietekat · ‎04-30-2009

(1) I invite you to read the documentation: 1 for the child, plus 1 for wait_for_all() itself.

(2) You may allocate a tbb::empty_task child to represent the reference, and spawn it to decrement the parent's reference count.

Blue_Sky_Studios · ‎05-08-2009

Quoting - Raf Schietekat

(2) You may allocate a tbb::empty_task child to represent the reference, and spawn it to decrement the parent's reference count.

I have not been able to find a way that works reliably and passes various assertions in debug mode.

When I used
tbb::empty_task * c = new (task.allocate_child ()) tbb::empty_task;
to create the child of the waiting task, I would sometimes get the complaint that task was not owned by the thread. Then I tried
tbb::empty_task * c = new (myself.allocate_additional_child_of (task)) tbb::empty_task;
However, this increments ref_count, so task now has a ref count 1 too high. If I try to use set_ref_count to lower it by 1, either before the allocation or after the allocation but before the call to spawn or after the spawn, I sometimes see ref_count being altered by another thread. The result is either a complaint that task already has ref_count = 0 at the time of spawning the child, or the task doesn't wake up (probably my set_ref_count raised it after an asynchronous decrement).

So I am still looking for a reliable way to let a task go to sleep until another task wakes it. Is there a way to set a lock that will allow me to read the current ref_count and then set a new value without risking an intervening modification by another thread?

By the way, I am using a lock to guarantee that only one thread can access my list of waiting tasks. The code is similar to the pattern suggested earlier in this thread.

RafSchietekat · ‎05-08-2009

"I would sometimes get the complaint that task was not owned by the thread"
Is there any reason why the parent couldn't call allocate_child() itself?

"However, this increments ref_count, so task now has a ref count 1 too high."
Then don't include this reference in the initial set_ref_count(), just the one for wait_for_all(), and note that it's probably too late to call allocate_additional_child_of() after the parent has called wait_for_all() in this context: you would have to know for certain that at least some other child still has not finished, and that does not seem to be the case here.

"If I try to use set_ref_count to lower it by 1, either before the allocation or after the allocation but before the call to spawn or after the spawn, I sometimes see ref_count being altered by another thread."
I'm not sure I understand the scenario (I'm a bit distracted now), but consider set_ref_count() not thread safe: it must never be used from different threads, or after a child has been spawned, or in competition with allocate_additional_child_of() or wait_for_all() or any other problematic combination I may have omitted here. If you are using set_ref_count() to modify a value that you are reading from the task, you're probably on the wrong track: normally you should know exactly what it should be so that you can tell the task without asking it first, and you should probably set it before doing anything else related to child tasks. Take the "probablies" to be understatements. :-)

"So I am still looking for a reliable way to let a task go to sleep until another task wakes it."
Let the task allocate its own empty_task and make a pointer available to whatever thread wants to wake the parent.

Alexey-Kukanov · ‎05-08-2009

The following is the key:

Btw, Arch, if your proposal wrt better DAG support will be incorporated into TBB, then, I think, it will be possible to provide much better experience for the end user. I.e. there will be no need for continuations (i.e. breaking task into several tasks), initialization of the resource ~~will be synchronous, but will still not block progress.~~

The decrement_ref_count function is from that proposal. It is not there yet, but it will be added.

lengthy postponed shared data initialization & thread locking question