1) What task_scheduler_init really do? Start/Restart X number of threads and waiting for incoming task in task queue and anything else?
2) How may task_scheduler_init an application could have? is it per thread or per application based?
3) What is the relationship of task_scheduler_init and task_group? I would assume a task_scheduler_init could associate with many task_group(s), but a task_group could onlyrelate with one task_schedule_init. is it right? if it is, what is a good way to balance it in practice.
4) if more than one task_schedler_init, will multiple task_schedule_init(s) compete underlying resources, like cpus
4) what is the overhead of create a task?
Thanks in advance!
Even easier. As of (I think) Intel TBB 2.2 or so, you don't even need to create your own task_scheduler_init objects. Current releases can create the needed task_scheduler_init object automatically and in the background. The only reason a developer using a current Intel TBB version would need to explicitly create task_scheduler_init objects is if they are playing some sort of thread pool size magic: for example forcinga thread pool to be larger or smaller than the nominal one SW thread per hardware thread. You might manage your own object experimentally to test the resource demands of your application or to provide a means to adjust thread pool size for such purposes.
The whole point of this architecture is to reduce overhead: TBB task creation is a much less expensive operation than OS thread creation, so Intel TBB amortizes the cost of thread creation by creating one pool that gets reused over the lifetime of the process. It's definitely more efficient to schedule work as a set of Intel TBB tasks than it is to create and destroy threads (ala pthreads, etc.) as demand dictates.
Though as Robert noticed, RTFM never hurts, effects of task_scheduler_init are indeed not as obvious as it might seem, especially if you are concerned about the actual actions that take place under the hood. I think Ill have to write a blog about the current state of TBB initialization and deinitialization to cover all nuances and pitfalls. Meanwhile here are the basic details of what task_scheduler_init does (and you could also have a look at this blog).
When the first task_scheduler_init object is instantiated in the process, it initializes internal thread pool manager, but no threads are created.
When the same thread repeatedly creates more task_scheduler_init objects, this is essentially no-op.
When another thread instantiates its first task_scheduler_init object, it attaches to existing thread pool manager, but can specify different concurrency level (see the aforementioned blog for more details).
Thus task_scheduler_init creation has both global and local effects.
If a thread executes an operation that requires the scheduler to be initialized (creates a task, invokes parallel algorithm, etc.), but no task_scheduler_init exist in this thread, the scheduler will be automatically initialized. In this case its concurrency level will be default (that is maximal parallelism supported by hardware), and (what may be extremely important in some cases) such scheduler instance will exist until its thread terminates.
TBB worker threads are created when the first task is spawned.
Class task_group is just a convenient abstraction built on top of TBB tasking API. In this regard it is close to TBB parallel algorithms. Thus there is no relation between the number of task groups and task_scheduler_init instances.
As a general remark, if your problem maps well to one of the TBB parallel algorithms, youll always be better off using them than task groups or tasks directly.
Overhead of task creation is amortized and normally is a few dozens of clock ticks. But you may need to add the cost of task spawning (and retrieval from the task pool), which may vary from a few dozens to a few hundred clock ticks depending on many factors (uniformity of your data range, concurrency level, etc.). There are a number of techniques described in the Tutorial (e.g. various kinds of task recycling) that can be used to decrease the cost task manipulations. In particular TBB parallel algorithms rely on them heavily, which makes them so efficient in most cases.