Bug with tbb::task_scheduler_init()?

halfy · ‎03-16-2012

I believe there is a bug in tbb::task_scheduler_init, with regards to specifying the thread count. Instead of clamping below the number of CPU cores on the system, it clamps above. With this bug, TBB will happily create 100s of worker threads, if task_scheduler_init is called with a large value for number_of_threads.

This error can be traced to market.cpp @ line 105:

max_num_workers = max( governor::default_num_threads() - 1, max_num_workers );

The max() call should probably be replaced by min(). Otherwise, only over-subscription is possible.

RafSchietekat · ‎03-16-2012

I haven't tracked down what the max() in market::global_market() does exactly, but you'll be happy to know that there's also a min() in market::create_arena(). You may find relevant information in "TBB initialization, termination, and resource management details, juicy and gory".

halfy · ‎03-19-2012

Quoting Raf Schietekat

I haven't tracked down what the max() in market::global_market() does exactly, but you'll be happy to know that there's also a min() in market::create_arena().

Right. But if there's only a single call to tbb::task_scheduler_init(), only a single market is created, with the max_num_threads specified by that single call. In that case, a sole arena is created, with the same num_threads value and so that min() is really a no-op. Now, the situation is different with multiple calls to tbb::task_scheduler_init(), but it still doesn't change the fact that setting the amount of worker threads to be less than num_cores-1 is not possible as-is.

We traced through the code and confirmed exactly the behaviour I reported. We have also patched it locally, and now it works as advertised (see Table 41 in the Reference Manual), but we'd rather not have to maintain that patch for the indefinite future.

Thanks for looking into this.

RafSchietekat · ‎03-19-2012

Table numbers vary across versions of the Reference Manual, so I don't know which one you are referring to?

As for the max, here's a quote from the blog I referred to above, so it at least seems to be intentional:

"The second limit is established when the market is created, and sets the ceiling on the total number of workers available to all master threads. It is determined as one less than the greater of the following two values: amount of threads specified by the argument of task_scheduler_init constructor, and current hardware concurrency (i.e. amount of logical CPUs visible to OS)."

Note also that threads are (supposed to be) created lazily, so not all slots in the market will automatically be occupied by a real live thread.

If that doesn't help, please describe exactly what goes wrong in a real program.

halfy · ‎03-19-2012

Table numbers vary across versions of the Reference Manual, so I don't know which one you are referring to?

Table 41 is from the current Open Source documentation Reference Manualon the TBB documentation page. Quote:

"Table 41: Values for max_threads ... positive integer [: ] Request that up to max_threads-1 worker threads work on behalf of the calling thread at any one time."

There's no mention of it being clamped above the current hardware concurrency. Maybe this should be clarified in the documentation.

If that doesn't help, please describe exactly what goes wrong in a real program.

In a real program, the same number of worker threads is created as is specified in the max_num_threads argument to task_scheduler_init, less one. As noted in my case, if I specify 1000 threads, 999 worker threads get immediately created the moment the first task is issued. This can be easily verified on Linux (either in gdb or by running top), and in Windows by monitoring the number of threads in the task manager.

However, I'm assuming that thread limiting is not really a feature of TBB per se, since it doesn't seem viable to set the the number of worker threads to ever be less than the current hardware concurrency --based on your quote.

This is unfortunate, as it seems that limiting a running application to use less than the system's hardware concurrency is not possible without patching TBB. Since this is a core feature of our software, and we never allow oversubscription, it looks like we'll have to maintain the patch for the indefinite future.

ahelwer · ‎03-19-2012

Actually, it is possible. If you set the process affinity to a subset of the cores you wish to run on and call task_scheduler_init with a parameter <= the number of cores to run on (or with no parameter), TBB will automatically limit the market threads to the number of cores available to it under the process affinity. Discussed further in this article:

http://software.intel.com/en-us/blogs/2010/12/28/tbb-30-and-processor-affinity/

I've never actually tested it, but the article goes into this at some length, including race conditions that result when task_scheduler_init is called inside multiple threads with different affinities in Linux.

RafSchietekat · ‎03-19-2012

"There's no mention of it being clamped above the current hardware concurrency."
Have you observed this other than by (mis?)interpreting the source code? Please also see my comments below.

"In a real program, the same number of worker threads is created as is specified in the max_num_threads argument to task_scheduler_init, less one. As noted in my case, if I specify 1000 threads, 999 worker threads get immediately created the moment the first task is issued."
That is intentional, because sometimes you want oversubscription to counteract underuse (as an admittedly non-ideal workaround), and you're supposed to know what you're doing when asking for a specific number of threads. Also, I highly doubt that performance and latency wouldn't suffer if threads were only created when sufficient tasks are added to an arena, because this would probably interfere with the benefits of recursive parallelism by reintroducing central coordination; the programmer's goal in creating tasks should be to provide sufficient "parallel slack" (within the limits imposed by parallel overhead), not to create mere "subthreads", and in such a situation all available workers would be put to use, making anything limiting their active number mere overhead.

"However, I'm assuming that thread limiting is not really a feature of TBB per se, since it doesn't seem viable to set the the number of worker threads to ever be less than the current hardware concurrency --based on your quote."

Again, the market is different from a thread's arena: worker threads come into the market lazily, depending on the needs of the master threads' arenas as specified in their task_scheduler_init instances (unlike availability of tasks, needs of arenas can reasonably be taken into consideration).

"This is unfortunate, as it seems that limiting a running application to use less than the system's hardware concurrency is not possible without patching TBB."
Are you perhaps using multiple master threads (application threads that use TBB), perhaps even without specifying task_scheduler_init in each and every one of them (causing those without an explicit task_scheduler_init before first use of a TBB feature to implicitly use the default number of threads, which is the number of available hardware threads)? That is a scenario in which you can inadvertently reach the system's hardware concurrency.

"Since this is a core feature of our software, and we never allow oversubscription, it looks like we'll have to maintain the patch for the indefinite future."
Is this an HPC system that prevents oversubscription (check the documentation again), or really your own choice? I would agree that oversubscription can be a bad thing, but it's not nearly as bad as undersubscription.

Perhaps if you can motivate that there needs to be an additional setting to change TBB's view of the available parallelism on top of how it currently handles different task_scheduler_init instances, such a modification could perhaps be considered (unless I missed something), but at this time that does not yet seem clear.

(Added after #7) Using affinity masks could be a workaround, if available. But let's first see if we can agree what the current problem really is.

Alexey-Kukanov · ‎03-19-2012

Hi halfy,

I read your posts here, but unfortunately I do not quite understand what you need or need not, so let me ask clarifying questions. What is the situation in which you specify a certain argument to task_scheduler_init (instead of relying on the default), and is the number you specify smaller or bigger than the machine size (or can be either)?

Overall, as others mentioned, this behavior is deliberate, not a bug. For whatever reason the application wants to oversubscribe the machine, TBB should not prevent it. And if you want to use *less* threads than available in HW, that's ok; specify the number you want, and it is guaranteed that there will be at most as many threads *working* on behalf of your application thread that called task_scheduler_init (see a Note right above Table 41 that extra threas, though possibly created, willremain asleep).

Also note that when you work with TBB from multiple application threads, each one will get no more workers than it requested, but the total number of active workers can be as big as the number of available HW threads.

RafSchietekat · ‎03-19-2012

"see a Note right above Table 41 that extra threas, though possibly created, willremain asleep"
Which environments non-lazily create extra sleeping threads, and how many? This could be crucial in environments where the number of threads is limited by the O.S.

Alexey-Kukanov · ‎03-20-2012

In the current implementation, threads are always created lazily, but there can be situations when extra threads are created. As far as I recall, the total number of threads cannot be bigger than what the market can maintain, i.e. that verymaximum of HW concurrency and the very first arena size.

SergeyKostrov · ‎03-20-2012

Quoting halfy

...As noted in my case, if I specify 1000 threads, 999 worker threads get immediately created the moment the first task is issued. This can be easily verified on Linux (either in gdb or by running top), and in Windows by monitoring the number of threads in the task manager...

I've completedthree tests ona 32-bit Windows XPand I couldn't reproduce the issue. Could you provide a simple Test-Case?

I tried to create 256, 512 and 1024 threads and results are as follows:

- Attempt to create 256 threads: 258 threads created (1 main process thread + 256 TBB threads+ 1 unknown thread )
- Attempt to create512 threads:514 threads created (1 main process thread +512 TBB threads+ 1 unknown thread )
- Attempt to create1024 threads:986 threads created (1 main process thread +984 TBB threads+ 1 unknown thread ) Note: Failed to create 1024 TBB threads

Here isa screenshot for the512 threads Test-Case:

Two more screenshots are attached.