Intel® oneAPI Threading Building Blocks
Ask questions and share information about adding parallelism to your applications when using this threading library.
2464 Discussions

More threads created than expected with task_scheduler_init

e4lam
Beginner
1,295 Views
It appears that in TBB 3.0, giving tbb::task_scheduler_init() a thread count less than the default number of threads still results it in creating the default number of threads. Debugging the code somewhat, I narrowed it down to market::global_market() where we have this line:
max_num_workers = max( governor::default_num_threads() - 1, max_num_workers );

Is this line necessary? If so, what do I do if I want to bypass this behaviour? Is recompiling with __TBB_ARENA_PER_MASTER set to 0 a viable alternative?

Thanks,
-Edward
0 Kudos
23 Replies
Terry_W_Intel
Employee
1,022 Views
Hi Edward,
In what scope are you declaring the task_scheduler_init object? A couple possibilities are that some TBB code was reached before that, and the default number of threads was used and the subsequent task_scheduler_init thus ignored. Alternatively, the task_scheduler_init is in some scope that you have exited, in which case TBB will go back to using the default number of threads.

So it would help to see the context in which you are using it.

Cheers,
Terry
0 Kudos
Alexey-Kukanov
Employee
1,022 Views
Yes more threads are created now, but the extra thread are not used and should just sleep. So my recommendation for you is to do nothing. If you see any serious problem with the new behavior, please tell us. Thanks!
0 Kudos
e4lam
Beginner
1,022 Views
Hi Alexey,

The extra threads use up extra stack space so I would like to NOT have them created. What is the best modification of TBB 3.0 to do this? I tried recompiling with __TBB_USE_ARENA_PER_MASTER set to 0 and it seems to do what I expect. Or, should I just comment out that line which I pointed out above? I'm not sure what side effects that might introduce though.

Thanks!
0 Kudos
ajclinto
Beginner
1,022 Views
Aside from the memory issue, as a software developer I'm used to having direct control over the number of threads that exist for debugging purposes. Seeing 8 threads in a debugger when I've instructed tbb to use only 2 is a little disconcerting and I believe it makes it more difficult to track down threading problems (for example, I recently tried debugging a threaded process involving 2 work threads while 6 threads were idle - making it difficult on a first glance to see which are actually being used). I'm also concerned that users inspecting the process might incorrectly assume that it is actually using all the threads that exist in memory.

Andrew
0 Kudos
ARCH_R_Intel
Employee
1,022 Views

Assuming an 8 core 2-way hyperthreaded machine, at 2 MByte/stack (our current default for 32-bit machines), that's 32 MByte of virtual address space out of a total of 2048 MByte, about 1.56%. That seems tiny. Though I concur that for machines with larger core counts, perhaps we should be allocating threads lazily.

Setting __TBB_USE_ARENA_PER_MASTER=0 instantiates the TBB 2.2 scheduler behavior. So you lose some of the protection from deadlock (where multiple master threads get entangled doing each other's work) in exchange for the reduced numbers of stacks. That might be the best work around for now until we figure out how we should address the issue.

0 Kudos
ARCH_R_Intel
Employee
1,022 Views
ajclinto has an excellent point about debugging. In fact, come to think of it, I was bothered recently by all the extra threads when I was debugging an example and switching between threads in the debugger.

I'm prototyping changes to src/tbb/private_server.cpp to make thread creation lazy. It seems straightforward so far, though I have not gotten to the "fun" part of the tricky shutdown logic. I'll report how it goes when I know more.
0 Kudos
e4lam
Beginner
1,022 Views
Hi Arch,

Do you know why we have this statement?
max_num_workers = max( governor::default_num_threads() - 1, max_num_workers );

I would have assumed that in a loosely coupled system, there would be no need to have this. When re-initializing would assume that we just need to cancels all the current worker threads and then start up with a new set of workers. Where/why are we relying on having at least default_num_threads()-1 workers?
0 Kudos
ARCH_R_Intel
Employee
1,022 Views

The RML interface for acquiring threads requires stating the maximum number of threads up front. This requirement greatly simplfies implementation of the RML on top of other thread managers (e.g. Microsoft ConcRT) that have similar requirements.

TBB 3.0 introduced the notion that each master thread (user-created thread) could have a different upper bound on the number of workers. E.g., consider a machine with 8 hardware threads. if master thread Xexecutes "task_scheduler_init init(5)", we allocate 4 workers (5 minus the master). If a master thread Y executes "task_scheduler_init init(8)" later, we allocate 7 workers, some of which are the same as the workers for thread X.

Because the "up front" requirement of the RML interface, we request at least "governor::default_num_threads() - 1" up front, since that is the normal worst case. The max accounts for deliberate oversubscription by the first master thread. Oversubscription requests by later threads are ignored.

max_num_workers is a maximum. The RML does not have to deliver this number of threads, since its purpose is to regulate thread usage. So making the RML lazy about delivering threads seems like a good approach. (I almost have my changes to private_server.cpp working.)

0 Kudos
ARCH_R_Intel
Employee
1,022 Views
Attached is a patch that addresses the issue. It causes src/tbb/private_server.cppto lazily allocate threads. Please let me know if it works for you. It passed our unit tests on a few machines that I tried, but has not yet been subjected to nightly testing across all our test platforms.
0 Kudos
e4lam
Beginner
1,022 Views
Hi Arch,

I did some simple tests with the patch and it seems to be working as expected! Thanks very much!

-Edward
0 Kudos
e4lam
Beginner
1,022 Views

Hi Arch,

I just tried upgrading to the latest stable TBB version, 4.1 update 3. However, it seems that your patch was never incorporated. Is there some reason why we do not want this in the regular release?

$ cat tbb-41-patch-max_workers
diff -urN --strip-trailing-cr tbb40.orig/src/tbb/market.cpp tbb40/src/tbb/market.cpp
--- tbb40.orig/src/tbb/market.cpp       2011-12-15 07:05:00.000000000 -0500
+++ tbb40/src/tbb/market.cpp    2012-03-16 15:43:24.953426500 -0400
@@ -102,9 +102,9 @@
             runtime_warning( "Newer master request for larger stack cannot be satisfied\n" );
     }
     else {
-        max_num_workers = max( governor::default_num_threads() - 1, max_num_workers );
+        max_num_workers = min( governor::default_num_threads() - 1, max_num_workers );
         // at least 1 worker is required to support starvation resistant tasks
-        if( max_num_workers==0 ) max_num_workers = 1;
+        if( max_num_workers<=0 ) max_num_workers = 1;
         // Create the global market instance
         size_t size = sizeof(market);
 #if __TBB_TASK_GROUP_CONTEXT


0 Kudos
Wooyoung_K_Intel
Employee
1,022 Views

Hi, Edward

Arch's patch to create worker threads lazily has been indeed incorporated and available in TBB update releases including TBB 4.1 U3

I don't know if the changes to market.cpp you qouted were in the patch Arch uploaded here in 5/20/2010. Those changes were not incorporated because we felt they were not needed for the lazy worker thread creation, and found no compelling reasons to make the changes.

Have you experienced some unexpected behaviors with the TBB 4.1. U3 ?

Thanks.

0 Kudos
e4lam
Beginner
1,022 Views

Sorry, I'm mistaken that the quoted patch was in there, although it is related.

The quoted patch is necessary because otherwise, there is no way to force the number of worker threads to be less than the number of cores on the system. This is extremely important in server farms, where some central authority schedules processes to run on and it needs to impose a resource limit on the scheduled jobs. For debugging purposes, it is also useful to have an easy to make an application run single threaded as well; both for comparison purposes as well as for the ease of stepping through in a debugger.

Since this patch seems to be have unintroduced (I thought I had submitted all our patches), would you consider it? If so, I can officially submit the patch through the web submission form.

Thanks!

0 Kudos
e4lam
Beginner
1,022 Views

Sorry again, it's been a long time since I last read Arch's post above regarding the use of max() instead of min(). However, if you don't do this, then I do not think there's a an easy switch to use less concurrency than the number of core right?

0 Kudos
Alexey-Kukanov
Employee
1,022 Views

Hi Edward,

While I understand the problem and the motivation for the patch, as the architect* I will not accept it into TBB, because it would preclude some useful scenarios (e.g. resource partitioning between multiple application threads), and would result in backward-incompatible behavior impacting existing applications.

Instead, we will add a mechanism for better control over global TBB settings, including the default number of threads. Most likely, it will be a special "policy" class, which instance affects TBB behavior for its lifetime. So you will be able to specify the desired global concurrency limit for TBB, which will be treated in the same way as HW concurrency currently is. The work is in progress, and I expect this to be released before the end of the year (though it's not a commitment).

Update: one possible way to limit the desired concurrency that works with TBB now is to set a certain process affinity mask that limits the application to only a subset of available HW threads/cores. TBB respects process affinity when it defines the default number of workers, so if you specify which cores/threads should be used, just as many workers will be created. Not sure if this approach is suitable for your use cases, but I think it's worth mentioning anyway.

* I took over the architect role from Arch some time ago.

0 Kudos
e4lam
Beginner
1,022 Views

Alexey Kukanov (Intel) wrote:

Instead, we will add a mechanism for better control over global TBB settings, including the default number of threads. Most likely, it will be a special "policy" class, which instance affects TBB behavior for its lifetime. So you will be able to specify the desired global concurrency limit for TBB, which will be treated in the same way as HW concurrency currently is. The work is in progress, and I expect this to be released before the end of the year (though it's not a commitment).

What you describe sounds like a good approach to achieving the same end goal. I'm looking forward to when I don't need this patch anymore! I'll keep the patch in the mean time since it has been working for us for over 3 years now.

Thanks, Alexey!

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,022 Views

A suggested modification to Alexey's suggestion is to create a class/struct with a ctor that sets the process affinity mask to the desired number. Create a static object that when created, calls the ctor and thus sets the desired process affinity mask, **** then assure that this static object is loaded first by the linker such that it will run prior to any other static object ctor that might instantiate the TBB thread pool. Once in main, after a tbb init, then call an additional member function in the static object to restore the process affinity mask to what it was at start of application (or some other value if you so desire).

Jim Dempsey

0 Kudos
Alexey-Kukanov
Employee
1,022 Views

That's good idea Jim, thanks for suggesting it. I think it can be already possible to do with task_scheduler_observer; need to check.

0 Kudos
e4lam
Beginner
1,022 Views

Sorry to bring this up again. There seems to be *still* no one way to enforce sequential execution? I was excited to find the global_control preview class but it enforces a minimum value of 2 (? at least that's what is documented). Is there a good way to do this using TBB without hacks yet? It's extremely important to be able to globally enforce sequential execution if for not reason other than for debugging.

0 Kudos
RafSchietekat
Valued Contributor III
970 Views

That's normally done for enqueued tasks, although those would also be delayed until the very end if any number of available workers were always busy with plain old spawned tasks from their own queue, so I'm "not sure" I entirely understand the rationale.

Have you tried removing this restriction yourself? The Note in the documentation says "In the current implementation", though, so it's probably not that trivial.

(Edited)

0 Kudos
Reply