- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I know that task_scheduler_init has virtually no overhead, so this seems like a non-issue, but for a particular problem where the time it takes to execute the parallel section that is virtually 1 ms, then task_scheduler_init slows down the execution. I'm using the commercial version of TBB and have looked into the OSS version and noticed a TlsAlloc/Get on the scheduler pointer so I have very little hope that the task-scheduler-init overhead will easily go away.
I've looked throughout the reference/tutorial and they all state that I should initialize at the beginning of the thread. Any ideas? hacks?
~Nick
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The TBB documentation expresses that parallelism has a cost, there is no free lunch. However improving the scheduler is always a good idea, if you find performance improvements that could be incorporated.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I will probably just not make it parallel (serial will be faster than parallel + task_scheduler_init) but this kills future scalability.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If you can't ensure the application that uses your plugin also uses TBB on its own, you might try to initialize TBB worker threads when your plugin is initialized, and destroy them when the plug-in is destroyed/unloaded. For example, you might make task_scheduler_init a DLL-wide object with deferred initialization, and call task_scheduler_init::initialize() at the first call to your method.
Another consideration is that the approach to parallelize a short method called thousands of times is too fine-grained and in general won't give good speedup and scalability. Parallelizing the caller (if possible) would probably give much more benefit.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
So if I have 1,000,000,000 small calls... and 8 processors... I should be able to just do a parallel_for over a blocked_range<> of that type, and let the partitioner do its job to parallelize it?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
class somefor {
public:
int * const sum;
somefor(somefor& irule, split whoadude) : sum(irule.sum) {}
somefor(int *_sum) : sum(_sum) {}
~somefor() {}
void operator()(blocked_range&range) const {
}
};
int _tmain(int argc, char* argv[])
{
int sum;
LARGE_INTEGER ticksPerSecond;
QueryPerformanceFrequency(&ticksPerSecond);
LONGLONG createSum = 0;
LONGLONG destroySum = 0;
LARGE_INTEGER startTimer, endTimer;
for(int i = 0;i < NUM_RUNS;i++) {
QueryPerformanceCounter(&startTimer);
task_scheduler_init task(task_scheduler_init::automatic);
QueryPerformanceCounter(&endTimer);
createSum += endTimer.QuadPart - startTimer.QuadPart;
//do something
somefor something(∑);
blocked_rangerange(0, NUMELEMENTSTOADD, NUMELEMENTSTOADD / USECORES);
QueryPerformanceCounter(&startTimer);
parallel_for(range, something);
QueryPerformanceCounter(&endTimer);
createSum += endTimer.QuadPart - startTimer.QuadPart;
QueryPerformanceCounter(&startTimer);
task.terminate();
QueryPerformanceCounter(&endTimer);
destroySum += endTimer.QuadPart - startTimer.QuadPart;
}
double perRun;
perRun = (double)createSum / (double)ticksPerSecond.QuadPart; //seconds run
perRun /= (double)NUM_RUNS;
perRun *= 1000.0;
printf("%f ms ", perRun);
perRun = (double)destroySum / (double)ticksPerSecond.QuadPart; //seconds run
perRun /= (double)NUM_RUNS;
perRun *= 1000.0;
printf("%f ms ", perRun);
return 0;
}
So this gives me these results for 10000 NUM_RUNS:
Destruction: 0.062370 ms
and for a single run:
Creation: 1.157677 ms
Destruction: 0.070252 ms
So it looks like TBB caches what it needs per thread.
One question: Does TBB destroy the resources it caches properly?...
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Some part of initialization is done only once for TBB. You could see this part in DoOneTimeInitializations function. It also includes initialization ofthread local storages.
Another part, however, is done every time the first task_scheduler_init object is created, and undone every time the last such object is destroyed. So in your test, part of initialization is performed at every iteration. And this part includes creation of the pool of worker threads. The good thing is that worker threads are created mostly asynchronously: the first thread is created in task_scheduler_init construction, then it creates two more threads, those create more, etc.
Even smaller part is done each time a task_scheduler_init object is created by a thread that does not have such an object alive. Basically, this part just initializes objects necessary for the calling thread.
And in case task_scheduler_init is created by a thread that already has such an object alive, it's just the matter of incrementing a reference counter.
For your case, I would still recommend you to have a global task_scheduler_init object which is initialized at plug-in initialization or at the first call, and destroyed at plug-in destruction or unload. This way you will create a pool of TBB worker threads and keep it for all the time you might need it. In every new call, create a local task_scheduler_init object, to ensure the calling thread can use TBB.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page