Right. But if there's only a single call to tbb::task_scheduler_init(), only a single market is created, with the max_num_threads specified by that single call. In that case, a sole arena is created, with the same num_threads value and so that min() is really a no-op. Now, the situation is different with multiple calls to tbb::task_scheduler_init(), but it still doesn't change the fact that setting the amount of worker threads to be less than num_cores-1 is not possible as-is.
We traced through the code and confirmed exactly the behaviour I reported. We have also patched it locally, and now it works as advertised (see Table 41 in the Reference Manual), but we'd rather not have to maintain that patch for the indefinite future.
As for the max, here's a quote from the blog I referred to above, so it at least seems to be intentional:
"The second limit is established when the market is created, and sets the ceiling on the total number of workers available to all master threads. It is determined as one less than the greater of the following two values: amount of threads specified by the argument of task_scheduler_init constructor, and current hardware concurrency (i.e. amount of logical CPUs visible to OS)."
Note also that threads are (supposed to be) created lazily, so not all slots in the market will automatically be occupied by a real live thread.
If that doesn't help, please describe exactly what goes wrong in a real program.
Table numbers vary across versions of the Reference Manual, so I don't know which one you are referring to?
In a real program, the same number of worker threads is created as is specified in the max_num_threads argument to task_scheduler_init, less one. As noted in my case, if I specify 1000 threads, 999 worker threads get immediately created the moment the first task is issued. This can be easily verified on Linux (either in gdb or by running top), and in Windows by monitoring the number of threads in the task manager.If that doesn't help, please describe exactly what goes wrong in a real program.
I've never actually tested it, but the article goes into this at some length, including race conditions that result when task_scheduler_init is called inside multiple threads with different affinities in Linux.
Have you observed this other than by (mis?)interpreting the source code? Please also see my comments below.
"In a real program, the same number of worker threads is created as is specified in the max_num_threads argument to task_scheduler_init, less one. As noted in my case, if I specify 1000 threads, 999 worker threads get immediately created the moment the first task is issued."
That is intentional, because sometimes you want oversubscription to counteract underuse (as an admittedly non-ideal workaround), and you're supposed to know what you're doing when asking for a specific number of threads. Also, I highly doubt that performance and latency wouldn't suffer if threads were only created when sufficient tasks are added to an arena, because this would probably interfere with the benefits of recursive parallelism by reintroducing central coordination; the programmer's goal in creating tasks should be to provide sufficient "parallel slack" (within the limits imposed by parallel overhead), not to create mere "subthreads", and in such a situation all available workers would be put to use, making anything limiting their active number mere overhead.
"However, I'm assuming that thread limiting is not really a feature of TBB per se, since it doesn't seem viable to set the the number of worker threads to ever be less than the current hardware concurrency --based on your quote."
Are you perhaps using multiple master threads (application threads that use TBB), perhaps even without specifying task_scheduler_init in each and every one of them (causing those without an explicit task_scheduler_init before first use of a TBB feature to implicitly use the default number of threads, which is the number of available hardware threads)? That is a scenario in which you can inadvertently reach the system's hardware concurrency.
"Since this is a core feature of our software, and we never allow oversubscription, it looks like we'll have to maintain the patch for the indefinite future."
Is this an HPC system that prevents oversubscription (check the documentation again), or really your own choice? I would agree that oversubscription can be a bad thing, but it's not nearly as bad as undersubscription.
Perhaps if you can motivate that there needs to be an additional setting to change TBB's view of the available parallelism on top of how it currently handles different task_scheduler_init instances, such a modification could perhaps be considered (unless I missed something), but at this time that does not yet seem clear.
(Added after #7) Using affinity masks could be a workaround, if available. But let's first see if we can agree what the current problem really is.
I read your posts here, but unfortunately I do not quite understand what you need or need not, so let me ask clarifying questions. What is the situation in which you specify a certain argument to task_scheduler_init (instead of relying on the default), and is the number you specify smaller or bigger than the machine size (or can be either)?
Overall, as others mentioned, this behavior is deliberate, not a bug. For whatever reason the application wants to oversubscribe the machine, TBB should not prevent it. And if you want to use *less* threads than available in HW, that's ok; specify the number you want, and it is guaranteed that there will be at most as many threads *working* on behalf of your application thread that called task_scheduler_init (see a Note right above Table 41 that extra threas, though possibly created, willremain asleep).
Also note that when you work with TBB from multiple application threads, each one will get no more workers than it requested, but the total number of active workers can be as big as the number of available HW threads.
Which environments non-lazily create extra sleeping threads, and how many? This could be crucial in environments where the number of threads is limited by the O.S.
In the current implementation, threads are always created lazily, but there can be situations when extra threads are created. As far as I recall, the total number of threads cannot be bigger than what the market can maintain, i.e. that verymaximum of HW concurrency and the very first arena size.
I've completedthree tests ona 32-bit Windows XPand I couldn't reproduce the issue. Could you provide a simple Test-Case?
I tried to create 256, 512 and 1024 threads and results are as follows:
- Attempt to create 256 threads: 258 threads created (1 main process thread + 256 TBB threads+ 1 unknown thread )
- Attempt to create512 threads:514 threads created (1 main process thread +512 TBB threads+ 1 unknown thread )
- Attempt to create1024 threads:986 threads created (1 main process thread +984 TBB threads+ 1 unknown thread ) Note: Failed to create 1024 TBB threads
Here isa screenshot for the512 threads Test-Case:
Two more screenshots are attached.