Though as Robert noticed, RTFM never hurts, effects of task_scheduler_init are indeed not as obvious as it might seem, especially if you are concerned about the actual actions that take place under the hood. I think Ill have to write a blog about the current state of TBB initialization and deinitialization to cover all nuances and pitfalls. Meanwhile here are the basic details of what task_scheduler_init does (and you could also have a look at this blog).
When the first task_scheduler_init object is instantiated in the process, it initializes internal thread pool manager, but no threads are created.
When the same thread repeatedly creates more task_scheduler_init objects, this is essentially no-op.
When another thread instantiates its first task_scheduler_init object, it attaches to existing thread pool manager, but can specify different concurrency level (see the aforementioned blog for more details).
Thus task_scheduler_init creation has both global and local effects.
If a thread executes an operation that requires the scheduler to be initialized (creates a task, invokes parallel algorithm, etc.), but no task_scheduler_init exist in this thread, the scheduler will be automatically initialized. In this case its concurrency level will be default (that is maximal parallelism supported by hardware), and (what may be extremely important in some cases) such scheduler instance will exist until its thread terminates.
TBB worker threads are created when the first task is spawned.
Class task_group is just a convenient abstraction built on top of TBB tasking API. In this regard it is close to TBB parallel algorithms. Thus there is no relation between the number of task groups and task_scheduler_init instances.
As a general remark, if your problem maps well to one of the TBB parallel algorithms, youll always be better off using them than task groups or tasks directly.
Overhead of task creation is amortized and normally is a few dozens of clock ticks. But you may need to add the cost of task spawning (and retrieval from the task pool), which may vary from a few dozens to a few hundred clock ticks depending on many factors (uniformity of your data range, concurrency level, etc.). There are a number of techniques described in the Tutorial (e.g. various kinds of task recycling) that can be used to decrease the cost task manipulations. In particular TBB parallel algorithms rely on them heavily, which makes them so efficient in most cases.