Re: Scheduler per process?

ndepalma · ‎03-18-2008

Hi,
I know that task_scheduler_init has virtually no overhead, so this seems like a non-issue, but for a particular problem where the time it takes to execute the parallel section that is virtually 1 ms, then task_scheduler_init slows down the execution. I'm using the commercial version of TBB and have looked into the OSS version and noticed a TlsAlloc/Get on the scheduler pointer so I have very little hope that the task-scheduler-init overhead will easily go away.

I've looked throughout the reference/tutorial and they all state that I should initialize at the beginning of the thread. Any ideas? hacks?

~Nick

AJ13 · ‎03-19-2008

So if the time taken to execute a parallel section is 1ms, why are you parallelizing it? Do you have a large number of these items to run? In that case you can group the jobs into a container, and use a parallel_for to run over all jobs and perform them in parallel.

The TBB documentation expresses that parallelism has a cost, there is no free lunch. However improving the scheduler is always a good idea, if you find performance improvements that could be incorporated.

ndepalma · ‎03-19-2008

Yes I completely agree. I think this is a borderline non-issue. For my particular case, as a plugin, I don't know how many times this function will be called, so I can't use a container. But many times it can be called 1,000,000 times or sometimes just once :-/

I will probably just not make it parallel (serial will be faster than parallel + task_scheduler_init) but this kills future scalability.

AJ13 · ‎03-19-2008

Can you share what you are attempting to do, at least at a high level, then I can probably help you more... I don't have enough information to provide advice of any value.

Alexey-Kukanov · ‎03-19-2008

If you can't ensure the application that uses your plugin also uses TBB on its own, you might try to initialize TBB worker threads when your plugin is initialized, and destroy them when the plug-in is destroyed/unloaded. For example, you might make task_scheduler_init a DLL-wide object with deferred initialization, and call task_scheduler_init::initialize() at the first call to your method.

Another consideration is that the approach to parallelize a short method called thousands of times is too fine-grained and in general won't give good speedup and scalability. Parallelizing the caller (if possible) would probably give much more benefit.

AJ13 · ‎03-19-2008

I might have misunderstood the docs... if you do a parallel_for over a very large number of small calls, wouldn't the auto_partitioner() or affinity_partitioner() automatically start dividing the work to use available processors?

So if I have 1,000,000,000 small calls... and 8 processors... I should be able to just do a parallel_for over a blocked_range<> of that type, and let the partitioner do its job to parallelize it?

ndepalma · ‎03-19-2008

Yes actually every call into my plugin is from a separate thread so I can't initialize the scheduler on init.. Though that was my first impression, I was quickly defused when I found out the program calls my plugin in a separate thread everytime. Maybe for very small entry points, the benefits of multicore probably don't scale over time.

ndepalma · ‎03-19-2008

To finish off this thread - I will explain my final decision. It turns out the application that I am writing a plugin for is calling into my plugin with a different thread everytime. At least that was what I thought, but are in fact using a thread pool which means that the ~1.3 ms penalty only happens on the first run of any "new thread" from my plugins perspective. I came up with this benchmark:

class somefor {
public:
    int * const sum; 
    somefor(somefor& irule, split whoadude) : sum(irule.sum) {}
    somefor(int *_sum) : sum(_sum) {}
    ~somefor() {}
    void operator()(blocked_range &range) const {
    }
};

int _tmain(int argc, char* argv[])
{
    int sum;
    LARGE_INTEGER ticksPerSecond;
    QueryPerformanceFrequency(&ticksPerSecond);

    LONGLONG createSum = 0;
    LONGLONG destroySum = 0;
    LARGE_INTEGER startTimer, endTimer;

    for(int i = 0;i < NUM_RUNS;i++) {
        QueryPerformanceCounter(&startTimer);    
        task_scheduler_init task(task_scheduler_init::automatic);
        QueryPerformanceCounter(&endTimer);    
        createSum += endTimer.QuadPart - startTimer.QuadPart;


        //do something
        somefor something(∑);
        blocked_range range(0, NUMELEMENTSTOADD, NUMELEMENTSTOADD / USECORES);
        QueryPerformanceCounter(&startTimer);    
        parallel_for(range, something);
        QueryPerformanceCounter(&endTimer);    
        createSum += endTimer.QuadPart - startTimer.QuadPart;

        QueryPerformanceCounter(&startTimer);    
        task.terminate();
        QueryPerformanceCounter(&endTimer);    
        destroySum += endTimer.QuadPart - startTimer.QuadPart;
    }
    double perRun;
    perRun = (double)createSum / (double)ticksPerSecond.QuadPart; //seconds run

    perRun /= (double)NUM_RUNS;
    perRun *= 1000.0;
    printf("%f ms
", perRun);

    perRun = (double)destroySum / (double)ticksPerSecond.QuadPart; //seconds run
    perRun /= (double)NUM_RUNS;
    perRun *= 1000.0;
    printf("%f ms
", perRun);
	return 0;
}

So this gives me these results for 10000 NUM_RUNS:

Creation: 0.054686 ms
Destruction: 0.062370 ms
and for a single run:
Creation: 1.157677 ms
Destruction: 0.070252 ms

So it looks like TBB caches what it needs per thread.

One question: Does TBB destroy the resources it caches properly?...

Alexey-Kukanov · ‎03-21-2008

Some part of initialization is done only once for TBB. You could see this part in DoOneTimeInitializations function. It also includes initialization ofthread local storages.

Another part, however, is done every time the first task_scheduler_init object is created, and undone every time the last such object is destroyed. So in your test, part of initialization is performed at every iteration. And this part includes creation of the pool of worker threads. The good thing is that worker threads are created mostly asynchronously: the first thread is created in task_scheduler_init construction, then it creates two more threads, those create more, etc.

Even smaller part is done each time a task_scheduler_init object is created by a thread that does not have such an object alive. Basically, this part just initializes objects necessary for the calling thread.

And in case task_scheduler_init is created by a thread that already has such an object alive, it's just the matter of incrementing a reference counter.

For your case, I would still recommend you to have a global task_scheduler_init object which is initialized at plug-in initialization or at the first call, and destroyed at plug-in destruction or unload. This way you will create a pool of TBB worker threads and keep it for all the time you might need it. In every new call, create a local task_scheduler_init object, to ensure the calling thread can use TBB.