Patterns of when and where to use this_task_arena::isolate

Alexander_F_2 · ‎11-22-2016

Hi,

my question is about when and where to use the TBB 2017 preview feature this_task_arena::isolate.

We hit on the problem described in https://software.intel.com/en-us/node/684814:

we have a parallel_for in high-level code
one worker thread processes a task from this outer loop
the thread does trigger lazy initialization of a data-structure. This is an internal optimization. The thread
- takes a lock,
- then creates a data-structure using an inner parallel_for, while still working on the task from the outer loop.
- The worker thread processes a task from this inner loop.
- Then the thread becames available again, before the inner loop is finished and while it is still holding the lock.
  - It processes a task from the outer loop.
  - This triggers again the lazy initialization of the same data-structure, for which the thread already holds the lock.
  - Deadlock

Now, based on https://software.intel.com/en-us/node/684814 and other threads (e.g. https://software.intel.com/en-us/forums/intel-threading-building-blocks/topic/401006, https://software.intel.com/en-us/forums/intel-threading-building-blocks/topic/611256, https://software.intel.com/en-us/forums/intel-threading-building-blocks/topic/285550) the best options seem to be:

Do not nest parallel_for. In this particular case we moved the lazy initialization before the outer loop, so the problem is solved. But it surprised us, because the two loops occur at very different levels of the application: high-level logic, low-level library code. And we are not sure if we have other cases where such a potentially fatal nesting could happen.
Use a task_arena for starting the inner loop. This supposedly has some non-trivial overhead.
Or use the preview feature this_task_arena::isolate. Supposedly this has only very little overhead. And I think in the case of our inner loop it is sufficient, that only the thread that starts the inner loop is prevented from taking tasks from the outer loop.

Now my question is:

Shouldn't parallel_for loops in libraries always be within a task_arena resp. this_task_arena::isolate? At least for code like ours, where you don't know if your code might called from within another parallel_for.
If so, where to start the task_arena resp. this_task_arena::isolate scope? Should it be close to the critical code and contain just the parallel_for call? Or, in our case with the lazy initialization guarded by a lock, should it be where it conceptually make sense, right after the lock is taken?

Thanks,

Alexander

Alexei_K_Intel · ‎12-04-2016

Hi Alexander,

The task isolation functionality does not completely covers the lazy initialization pattern. The main issue is that when a thread hit uninitialized data structure it tries to acquire the lock to process the initialization. What if the thread is not the first one and some another thread is already processing the initialization? The thread that cannot acquire the mutex will just wait for the data structure initialization completion. So it cannot join the nested parallel loop and make the initialization faster. Currently, to cover the lazy initialization scenario (if you want to process initialization in parallel), it is better to use task_arena because you can explicitly join threads to it (by calling the execute method). However, it can introduce some overhead because the worker threads will migrate from the main/default arena to initialization arena. Theoretically, you can try create arena without workers (the second parameter in constructor should be equal the first one). Then only the threads that hit uninitialized data structure will join this arena explicitly. Does this solution work for you? Or would you like to have a possibility to join threads to isolated regions and not to use additional arenas?

In my opinion, the idea "not to nest parallel loops" is not good because it is one of the main Intel TBB features that parallelism can be composable on different levels of application. The composability idea gives the potential to extract as much parallelism as possible. That seems to be a good design pattern.

The main idea of task_arena is limit the number of threads that can participate in particular work. It also provides the isolation features. If you need only isolation it is better to use the task isolation functionality because it is lightweight. However, if you have contention on the mutex then many threads cannot join the isolation region and cannot participate in some useful work. (Keep in mind, that Intel TBB has a thread pool with a limited number of threads. If some of the threads are blocked on the mutex then the CPU can be underutilized.)

As for applying isolation everywhere. Yes, it is possible. However, it can limit the available parallelism and overall performance can be less than possible. So in general case, it is not recommended to apply it everywhere.

If it is required to use a mutex over some parallel part of the code. Consider whether the code can be called from parallel context. Even if has only a possibility (e.g. it is a library entry point) then use isolation. Also it makes to reduce isolation region as much as possible not to reduce potential parallelism.

Interestingly, there is no difference if the mutex is acquired before the isolated region or after. However, if the scoped lock is used, it can make sense to acquire the mutex inside the isolated region because it has scope, i.e.

tbb::this_task_arena::isolate(
    [] {
        guard_lock lock(mutex);
        // Parallel work
    } // The guard_lock will be released here automatically that is safe in case of exception.
);

Hopefully, I can answer your questions. However, if something is unclear or you have some additional thoughts feel free to ask.

Regards, Alex

Alexander_F_2 · ‎12-05-2016

Hi Alex,

thanks a lot for your answer. A few details are still not clear to me.

Alexei K. (Intel) wrote:

The task isolation functionality does not completely covers the lazy initialization pattern. The main issue is that when a thread hit uninitialized data structure it tries to acquire the lock to process the initialization. What if the thread is not the first one and some another thread is already processing the initialization? The thread that cannot acquire the mutex will just wait for the data structure initialization completion. So it cannot join the nested parallel loop and make the initialization faster. Currently, to cover the lazy initialization scenario (if you want to process initialization in parallel), it is better to use task_arena because you can explicitly join threads to it (by calling the execute method).

I do not understand how to make worker threads which wait on the lazy initialization task join the the lazy initialization task_area.

If the lazy initialization task is guarded by a lock, all tbb worker threads that hit the lock will be blocked until the initialization task is finished. So do I have to prevent them from blocking on the lock? And then manually make them join the task_arena? Which task_arena method provides this functionality? The execute method expects a function to execute, so this doesn't really help in joining the work of an existing lazy initialization task, right?

Or is there some tbb automatism here that I am missing?

A code example would be really helpful.

Best,

Alex

Alexei_K_Intel · ‎12-05-2016

I'll try to answer with an example:

void Initialize() {
    static std::atomic<bool> initialized = false;
    static std::mutex mutex;
    static tbb::task_arena arena;
    static tbb::task_group tg;

    // Check if initialization is already done.
    if ( !initialized.load(std::memory_order_acquire) ) {
        // Check if we are the first.
        if ( mutex.try_lock() ) {
            // Double check that initialization is not done yet.
            if ( !initialized ) {
                arena.execute( [] {
                    tg.run_and_wait( [] {
                        // Do initialization.
                    } );
                } );
                initialized = true;
            }
            mutex.unlock();
        } else {
            // Join the initialization process.
            arena.execute( [] {
                while ( !initialized )
                    tg.wait();
            } );
        }
    }
}

Take into consideration that the solution is not exception-safe (i.e. be aware if initialization process can throw an exception.)

Alexander_F_2 · ‎12-06-2016

Oh, I wasn't aware of task_group and how you could use it in this context with task_arena.

Thanks a lot, that answers my questions.

Best,

Alex

Alexei_K_Intel · ‎12-14-2016

Hi Alexander,

I am writing new comment to ensure that you will receive a notification. I was advised that my lazy initialization approach based on tbb::task_group and tbb::task_arena has a race in a part where threads join the initialization process:

...
        } else {
            // Join the initialization process.
            arena.execute( [] {
                tg.wait();
            } );
        }
...

When the first thread acquires the mutex, other threads can try to join initialization process but the task group is empty (because the first thread does not start initialization yet). So other threads can skip waiting and leave the initialization process even before the initialization is initiated. To fix this race we need to add while loop over waiting:

...
        } else {
            // Join the initialization process.
            arena.execute( [] {
                while ( !initialized ) // To prevent a race between mutex acquiring and intialization initiating
                    tg.wait();
            } );
        }
...

I have also updated the initial version.

Regards, Alex

Alexander_F_2 · ‎12-14-2016

Hi Alex,

thanks for the fix. I didn't spot the race condition in your original example.

Best,

Alex

Chan__Eric · ‎12-02-2017

I have a question about this proposed solution (Alex's post). Isn't it possible for the run_and_wait call (line 14) to try to "steal" a task from another thread that is doing lines 23--25? That is, could it happen that the first thread that gets to the run_and_wait call ends up executing a task from another thread that is trying to join the initialization? (If so, wouldn't that cause a deadlock?)

Alexei_K_Intel · ‎12-04-2017

Hi Eric,

If there are more threads than the task_arena capacity (max_concurrency) then it may cause the task_arena::execute method to enqueue the functor instead of executing on the calling thread. Because of loop on line #24 it can be a deadlock if this task is executed by the first thread (that called the run_and_wait method). To avoid this behavior, the task_arena should have capacity (max_concurrency) enough to allow all threads to join.

In this example is everything by default so the arena concurrency is the same as the number of TBB worker threads (+main thread). Do you observe issues with this default behavior?

Regards,
Alex

Alexei_K_Intel · ‎08-21-2018

Additional information can be found in the blog article about work isolation.