wait_for_some() ?

Derek_G_ · ‎03-30-2016

Hello all!

I'm currently using the pattern with the "dummy" (empty) task to be able to spawn() many (millions) of worker tasks from the main thread. This is working excellently except for one small issue:

After I finish spawning all of those (and actually during) I have quite a bit of other work to do on the main thread... but not enough to keep the main thread completely occupied 100%. Because of this I'm losing some efficiency.

What I would like to be able to do is "lend" the main thread to the thread pool for a short while. Either based on time - or based on a number of tasks I would allow the main thread to do... or something. Like the title says: I would like to do "wait_for_some()" :-)

Currently, I just live with the inefficiency... letting the main thread do the "other" work it can until all of the tasks are complete (which I know through task-related counters) and then I exit.

Anyone have some ideas?

BTW: The work is _highly_ asynchronous. Every task is _completely_ independent. Some tasks actually spawn more tasks... but even _those_ tasks are completely independent from the tasks that spawned them! At any time the main thread can also "find" more work to... and create even more tasks (that's the primary thing the main thread is doing). I have no idea up front how much work there is to do.

Thanks for any help you can give me!

RafSchietekat · ‎03-30-2016

You probably shouldn't be spawning millions of tasks at once if you care about performance. Parallel overhead is real. Use recursive parallelism. Use algorithms that use recursive parallelism.

You probably won't be able to do that with the scheduler, unless you do something with arenas (otherwise you might not know when TBB will let you go again). An idea: don't just drop the work into the scheduler all at once as tasks, but rather keep it in a container as objects describing the work, and then access that either from worker threads (a single processor task could use a TBB algorithm to do that), or from the main thread (whenever it wants to chip in).

Derek_G_ · ‎03-31-2016

It is coping with the large amount of Tasks now... and speedup is good. Just being greedy and wanting a little more :-)

To give you an idea... I actually have billions of tasks to run... but I currently have them spread out across about 10,000 processors (384 noes at 24 cores each). All processors are generating work and filling buffers that spill over onto the other nodes dynamically (that part is MPI). It's all fully unstructured, asynchronous communication patterns.

If everything is working properly then any one node should probably have around 100,000 to 1M tasks alive on it at a time. It's a lot... but it's working.

Too bad about getting the master processor in on the game. I realize that I could write my own task queuing system... but that really kind of defeats the purpose of using TBB :-)

Thanks for the reply!

Alexei_K_Intel · ‎03-31-2016

I concur with Raf that it is better to use recursive parallelism. The simplest algorithm for it is parallel_for, however, you need to know the number of iterations (tasks) in advance. Perhaps, you want to use parallel_do that allows adding tasks "in fly". But both of these algorithms are blocking,i.e. calling thread will wait for its completion. To overcome this limitation you may want to use task_group that is simple interface for task spawning and waiting. So, the algorithm can be:

tbb::task_group tg;

// Spawn a task that spawns millions of other tasks
tg.run( [] {
    tbb::parallel_for( 0, NumberOfTasks, []( int i ) {...} );
} );

// Do other work on the main

// Let the main thread to join the task group
tg.wait();

It should be noted, if your machine has only one hardware thread, the body of the run method will not be processed until the main thread joins the task group with the wait method (because Intel TBB will have no workers and only the main thread will process the tasks).

Perhaps, you need to use tbb::simple_partitioner with tbb::parallel_for to guarantee that each iteration of the loop is processed in a separate tasks (but why do you really need this guarantee?).

Derek_G_ · ‎03-31-2016

Alex, thanks for the reply!

It sounds like task groups might be what I need. I can fill them up for a while using the main thread... and then have the main thread "join in" on the fun until that group is taken care of... then go back to filling with more tasks again. That sounds doable.

Unfortunately I can't use parallel_for. I have nothing to loop over... the work is all generated on the fly. I tried a previous incarnation where I generated the work into vectors (I actually tried multiple different containers) so that I could parallel_for over it... and it was MUCH slower than just generating tasks on the fly along with the work. With this many pieces of work... just moving them around takes time... so any time you're not doing things in parallel you're losing ground.

Oh: I don't "need" to assign one task to one piece of work... but like I said... if I try to agglomerate the tasks into storage containers I end up "wasting time" just moving them around.

I definitely think I can "batch process" these tasks a bit using task_groups. I'll give that a try soon!

Thanks again!

Alexei_K_Intel · ‎03-31-2016

I've drafted a simple example how a thread may join task processing for some time. In the example, the main thread allowed to process only 50 tasks and return back to process some other work.

#include <iostream>

#include "tbb/task.h"
#include "tbb/task_group.h"
#include "tbb/spin_mutex.h"
#include "tbb/enumerable_thread_specific.h"

int main() {
    // Counter of processed tasks by the main thread
    int processedMain;
    // Create an empty task for waiting
    tbb::task &waitTask = *new (tbb::task::allocate_root()) tbb::empty_task();
    // Set the reference count more than 1 to force the waiting thread to search tasks.
    waitTask.set_ref_count( 2 );

    // Create a callback that will "release" the waiting then when a condition is fulfiled.
    processedMain = 0;
    auto callback = [&processedMain, &waitTask] {
        // "Release" the main thread when 50 tasks are processed.
        if ( ++processedMain == 50 ) waitTask.set_ref_count(1);
    };
    tbb::enumerable_thread_specific<decltype(&callback)> ets;
    // Store a callback address in ets. Only the main will "see" this address, other threads will "see" NULL.
    ets.local() = &callback;

    tbb::task_group tg;
    tbb::spin_mutex mutex;
    tbb::atomic<int> processedTotal;
    processedTotal = 0;
    // Run 1000 tasks.
    for ( int i = 0; i < 1000; ++i ) {
        tg.run([&processedTotal, &processedMain, &mutex, &waitTask, &ets] {
            // Call a callback from the main thread.
            auto f = ets.local();
            if ( f ) (*f)();

            int n = processedTotal++;
            // "Release" the main thread when all work is done.
            if ( n+1 == 1000 ) waitTask.set_ref_count(1);
            // Do some useful work.
            for ( volatile int i = 0; i < 10000000; ++i );
            tbb::spin_mutex::scoped_lock l( mutex );
            std::cout << "\rProcessedTotal: " << n << " ProcessedMain: " << processedMain;
        } );
    }

    // Let the main thread to process some number of tasks.
    waitTask.wait_for_all();
    {
        tbb::spin_mutex::scoped_lock l( mutex );
        std::cout << "\t\t\tMain thread finished";
    }

    // Do useful work on the main thread.
    while ( processedTotal != 1000 ) ;

    // We need to call 'wait' to prevent the race with the last task.
    tg.wait();

    // The waiting task should explicitly destroied.
    tbb::task::destroy( waitTask );

    std::cout << std::endl << "Done" << std::endl;

    return 0;
}

The approach should be used with caution if nested parallelism is present. The waiting thread cannot finish task processing while being on nested level. Moreover, it is allowed to process outer-level tasks on a nested level.

RafSchietekat · ‎03-31-2016

This is about #4 and #5: Hey, what is being discussed here is letting the main thread join in until everything is processed. That's not the same as lending it out with "wait_for_some()" (and returning to some higher-responsibility business before all pending work is finished)! If any of those tasks generate more work, the main thread will try to steal it, and there's nothing you can do about that!

This is about #6: I see that with the code in the previous posting the main thread is only going to process a fixed number of the original pieces of work. But now you have to be careful if any additional work is being generated. Let's say that the main thread uses a parallel algorithm. Let's say that a worker thread steals some of the algorithm's work. The main thread is now blocked, and will try to steal work from a worker thread. And then you don't know where it ends. Is there other work on the scheduler? You might be stuck in there until the whole original batch of 1000 tasks is processed, or even longer. Is there no other work? You could theoretically still end up processing more than 50 original pieces' worth of work before you're through, especially if the task_group allows stealing more than one piece of work at a time, which I would assume to be the case, but I have not looked at the code yet. I would say there are no guarantees, and you either have to try, or reason very carefully about what is likely to happen, and maybe you only care about averages anyway. Does that sound plausible, or did I overlook something?

If you reserve room in your containers there's no reason for that solution to be slower. In fact, if you put each bit of work in a task, that's a call to the scalable allocator each time, and each task has to be dispatched separately. Not that TBB's allocator is slow, but certainly slower than adding an element to a vector that has room to grow. Maybe you should try something hierarchical to avoid reallocation, e.g., a vector of vectors (I'm assuming C++11 move semantics, otherwise you may have to work around that).

Derek_G_ · ‎03-31-2016

Thank you both for the interesting discussion... and especially for the VERY interesting piece of code.

Like Raf I think there are some interesting edge cases in that piece of code. But it still could be useful. With so much work to do there is plenty of time for the master thread to get "taken" by some of the work for a while.... as long as it doesn't go away forever! Maybe it would be possible to also take some sort of time / counter into account for that callback as well? Some sort of "safety net" that would allow it to bail out if it's been doing work too long?

RafSchietekat · ‎03-31-2016

You could only (theoretically) have a time limit on stealing, but it could still leave the main thread blocked indefinitely if any of its dependencies were blocked somewhere on another thread's stack. It's the same problem, whether your criterion is number of original tasks started/finished or time spent.

You can certainly try Alex' proposal, and it might very well work for you if you can live without guarantees, but I don't think it's the best solution.

Alexei_K_Intel · ‎04-01-2016

Raf, you are right that using any blocking algorithms (nested parallelism) can lead to blocking the main thread until all work is done. It is better to say that the main thread will process at least 50 tasks but we cannot say exactly 50 tasks.

Derek, to tell the truth, I am not sure that all these stuff is worth implementing. Even the simplest Raf's suggestion with a container of tasks requires significant reworking of the code. However, in the best case the maximum speed up is 24/23 (less than 5%). Have you tried to find hotspots with Intel VTune Amplifier (or similar tools)? Perhaps, there are exist some other places that can be easily improved. How long is one task? If the tasks are really long, perhaps, oversubscription can extract the last drop of performance. You can create NUM_CORES+1 threads and send the main thread to sleep (with conditional variable, timer or something else) when there is nothing to do.

RafSchietekat · ‎04-01-2016

I don't think that "the tasks [can be] really long" if Derek "actually [has] billions of tasks to run". :-)

If you can allow oversubscription (not the TBB notion, but at the OS level, and that is not a given in the HPC world), it does seem like a plausible solution to just treat the main thread as "extra". It could do communication, enqueue work (maybe in a separate task_arena), and go to sleep again. Running it at a higher priority seems appropriate. A minimalist approach would enqueue asynchronous "talk to MPI" tasks.

Derek_G_ · ‎04-01-2016

I completely agree that I probably have bigger problems than this in the code! I definitely plan to attend to those first... I just wanted to throw this question out there generate some discussion on this topic. I really appreciate the input!

As for how long the tasks are... they are pretty short (most of the time). We measure them in nanoseconds... I can tell you that for some runs I was doing over the last couple of days the individual tasks were averaging ~3000-5000 nanoseconds each. Note: the tasks are _very_ unstructured in their memory accesses. There is absolutely zero cache locality to exploit (think about traversing a (very) large graph data-structure from random starting positions and going in random directions).

I may try oversubscription... I was thinking that I could just run sched_yield on that master thread periodically (or sleep like you say). Would that play well with TBB? If it helps: my OS is Linux and I'm on a pretty standard Intel based HPC architecture (nothing exotic).

Thanks again for the replies!

jimdempseyatthecove · ‎04-02-2016

If you oversubscribe, I've found that shed_yield often results in some threads never getting an opportunity to run. Whereas sleep(0), works. Therefore, if you experience non-progress on some tasks, consider using sleep(0).

Jim Dempsey

RafSchietekat · ‎04-02-2016

I get the impression that you might gain more performance by having each task do more work than by worrying about that main thread (on this hardware anyway). I think that the advice is to aim for about 100k clock cycles. Less means more parallel overhead, and more is only a problem if it doesn't give you enough parallel slack to keep everything busy right until the end of a computation. It's a very shallow curve, though, so we're talking orders of magnitude rather than a precise number. Still, 10k or so clocks seems to be on the low side.

Derek_G_ · ‎04-02-2016

Thanks Raf - I'll keep that in mind. I knew there was a lower bound for the size of work you wanted to parcel out as a task... but I had never seen any guidance on it. Good to know it's in the 100k clock cycles range... I'll squirrel that number away in the back of my brain :-)

There are a few ways I can get more work per task... so I'll investigate that a bit...

RafSchietekat · ‎04-03-2016

The guidance is in Developer Guide > Parallelizing Simple Loops > parallel_for > Controlling Chuncking. Note that the graph is in terms of numbers of iterations in a task, not numbers of clock cycles, and it's not even clear that "For example, if a single iteration takes 100 clocks" applies to that particular graph, but the text does say: "A rule of thumb is that grainsize iterations of operator() should take at least 100,000 clock cycles to execute." Note also that grainsize is the maximum chunk size before splitting, so average chunk size is less than that with simple_partitioner, but I'm not currently sure whether a single task wouldn't also be executing several of those chunks.

Of course it's a different context, and observed with different hardware, and an earlier release. Apparently the curve is also really very shallow, and maybe you won't see big problems even with those very small tasks. But you may still be able to gain back (far?) more than the kind of performance that's lost by one thread not pulling its weight on a 24-core node (let's say 48 hyperthreads, then that's somewhat less than 1/48, so with 2% less parallel overhead you would be there already).