There may be a way to officially do this with TBB (like ippSetAffinity or KMP_AFFINITY=)
If not, then you can (by other means) set the process affinity mask to the logical processors representing one per core. Do this prior to starting the TBB thread pool. Then start the TBB thread pool with half the threads. Look at the TBB init for controls affecting affinity and/or related settings.
In QuickThread you can use
parallel_for(OnEach_L1$, fn, from, to, arg, arg, ...);
Which will establish a thread team using one thread per core even with HT enabled.
Another approach is to affinitize worker threads using tbb::task_scheduler_observer. But as was already dicussed in other threads, using hard affinity ruins composability of your application (resulting in drastic performance drop in case there are other workloads running concurrently).
Besides you'll need to know the mapping between bits in OS affinity mask and phisical CPU layout, and there is not straighforward way to this info from OS. Though it seems that using successive bits in the affinity mask often does what you want, there is not guarantees that any partcular machine follows this rule.
When you have one TBB app running on your system, then you can affinitize all threads of the TBB app to float amongst one thread of each core. The mapping can be derrived from an environment variable that you set (if not present you can cross fingers and let the O/S manage this). Depending on O/S and configuration options, the mask value may vary. A seperate conditioner program can tell you the bit mapping (CPUID works well for this). I place the CPUID functions in my code but the environment variable would be more portable platform wise.
When you have two TBB apps running on your system with HT (or even without HT), then you will have to decide (through testing)if it is better to have them run on different HT siblings or on the same HT siblings. There are issues of L1cache evictions caused between HT siblings between the apps. L2 isn't as severe. In the two TBB app scenario running on same HT thread set you can incorporate SwitchToThread() to provide run bursts following the SwitchToThread() which will maintain/extend your L1 cache population by those softwarethreads sharing the core. This will add overhead so it has to be weighed against the benefit. You may want a counter that determines the number of additional TBB apps running on your system. Then enable/disable or modify this feature.
More than two TBB apps running will complicate the matter and at which point you may want to give up control to the O/S or happenchance.
Personally I prefer to use all HT siblings and control which through the adaptive scheduling I have available to my QuickThread programs. In this case an example would be parallel_for( OneEach_L1$, ... which says: use all cores but only one thread from eachHT siblings (if HT on system) without regard to which HT sibling grabs slice. In this way you do not need to coordinate with other parallel apps are running on the system. If one of the siblings is busy running a different app, the other will automatically take the slice. If both are running different apps they wait their turn (O/S scheduling). If they wait longer than time for other threads (one per core) take to complet slice, then one of those threads takes slice (you can control if slice grabbing is permitted, and by whom, or not).
The TBB app is the only program running on my system (except OS processes). I want to start 1 task per core and let OS to distribute tasks between cores.
What is the environment variable which tells the number of physical cores? If there is one, my program could use it. In there is none, I would need to instruct users to define one.
I think that there is a better way. You might be looking at the problem from the wrong direction. For example, in a 3d rendering system, a common mistake is to put 3d rendering on one thread, physics on another, network on another, and so on. Now, there are specific reasons why a thread can be dedicated to one task only, generally, there are ways around it, which tend to lead to better performance --this is what TBB is designed for.
There are ALWAYS areas of code that can be coded for parallel processing. If you limit yourself to setting a job per physical code, then what happens when a 24 core version comes out? Parallelize your code at a deeper level... Find the paralelable (is this a word?1?) areas. If you cannot find any, then you arent looking hard enough :P
I am unaware of any OS environment variable that tells the number of physical cores.
You should take the advice of those on this forum, and the opportunity with your current requirements to re-think your problem in terms of tasks instead of threads. Once you do this you will find more opportunities for parallelization.
When thinking in threads (pthread approach) you generally take an application and decompose it by major function and assign one thread per major function. Then examine each major function to determine if they can be broken into sub-functions run in parallel (new set of threads), then examine the sub-functions to see if they can be further sub-divided, etc... The effectiveness of this method greatly varies with the number of hardware threads available and a lot of work is involved in analysis for decomposition. This work has to be redone when the platform changes.
Using tasks and thread pools (TBB and QuickThread method) leads to easire decision points as to how to partition the major functions, sub-functions and sub-sub functions. And this effort need not be redone using tasks to the extent that it may require when explicitly using threads.
My code is arithmetic heavy and doesn't benefit from hyper-threading. It is not easily parallelizable, but I want to improve its scalability as much as possible. Due to the nature of the problem, data flow between threads is fairly intensive and threads operate on shared data. Blocking and false sharing must be avoided as much as possible. Probably my code is not scalable beyond 8 cores in typical situations.
Hope it clarified my situation.
Before you jump into a general solution, have you actually validated this hypothesis of better performance by using task_scheduler_init to set the number of threads with a hard-coded number specific to your test machine and running comparative benchmarks? I couldn't make that out from what's been written.
"I have a situation where the number of grains available for parallel processing is limited."
I don't think that's how "grains" is supposed to be used: you can set grainsize, but you don't have a number of grains available, nor do you split work into grains. Those are tasks, even if you can of course assemble several such tasks in one tbb::task, sometimes depending on grainsize. If I'm mistaken, please educate me, otherwise, please cease and desist, because it's confusing. :-) Although, having task and tbb::task mean different things is confusing, too. Anyone from the TBB team to the rescue?
Now, what are problems?
The range is too small. Work balancing is an issue. The good news is that each element of the range represents more or less equal amount of work. On Xeon with 8 cores and 16 it makes no sense to use all 16 threads, but on Core i5 it makes sense to use all available threads.
Hope it clarifies my question.
I'm still waiting for feedback from a native speaker in the TBB team about this vocabulary problem. :-)
"The range is too small. Work balancing is an issue. The good news is that each element of the range represents more or less equal amount of work. On Xeon with 8 cores and 16 it makes no sense to use all 16 threads, but on Core i5 it makes sense to use all available threads."
My question is about your findings so far with different hard-coded values, to make this a practical matter instead of a theoretical one with just opinions?
(Replaced previous version in second paragraph a few minutes after posting it, sorry for any confusion.)
TBB actually works better when there is ALOT of work to do for each task because there is some overhead assosiated with creating and running the tasks.
16 threads: four threads run 2 iterations, twelve run 1 iteration
8 threads: four threads run 3 iterations, four threads run 2 iterations
2 threads: 2 threads run 10 iterations
Assume first two situations are your 8 cores with HT system
Assume third situation is processor with 2 cores no HT
Assume all cores same frequency same memory rating same cache, ... (identical)
Assume running HT sibling runs adds +30% aggregate improvement
Note, integer intensive (non-memory intensive) HT environment impact on HT is little to none (read 100% aggregate improvement)
For HT assume average is 1.30/2 = .65 performance per threadof non-HT thread or 1.54 slower
Assume an iteration on non-HT takes 1 unit of time
16 thread:1 *1.54 (first iteration by 16 threads) + (1 * 1) or (1 * 1.54) (depending on thread mix)
more importantly 12 threads are available to do additional work
latency = 2.54 to 3.08 units (random average 2.81 units)
available computational resources (12 * .65 * 1unit) = 7.8 units
8 thread: 2 * 1 (first two iterations by 8 threads) + 1 * 1 (third iteration by 4 threads)
with 4 threads available to do additional work (8 HT thread siblings idle)
latency = 3 units
available computational resources (4 * 1 unit) = 4 units
2 thread: 2 * 10
0 threads available to do additional work
latency 20 units
available computational resources 0 units
While the estimated HT configuration is marginally faster (6.76%) when looking at latencies you are forgetting to look at the available computational resources 7.8 units with HT versus 4 units without. Nearly twice as much residual (waiting) computational capacity. For simple test programs, this is invariably idle time, for real world applications, this is processing time available to do other work in your application.
TBB and other task model threading paradigms puts this available computational resourceback into your program.
Note, the HT effect rule of thumb (+30% overhead) applies to a random mix of programming. In some of the few favorable circumstances, when you can coordinate the HT siblings to work together (one makes use of L1/L2 loads by the other HT sibling) then you can attain at times a~500% boost in performance. Take some time to learn how to best use HT.
"I thought about hard wiring the number of threads. In my case the problem's size is not know at compile time. I will probably shift the burden to the users and let them to set the number of threads via config file or environment variable."
That's not what I meant. If it's a real issue, you should be able to demonstrate it with hard-coded values, otherwise it's imaginary and probably not worth any further discussion or exploration. So which is it?
My case is FPU heavy SSE. It is moderately memory intensive. My tests showed no benefit from HT. I want to optimize single program execution and don't mind HT siblings being wasted, as long as my program runs faster. I think ~500% would not apply to to the main program (100% is theoretical limit).
Btw, how do you control HT siblings? I can imagine how to it only implicitly, by structuring the program's flow accordingly.
It is empirical, not theoretical. For example, running 8 tasks on 8 physical cores is better that running 9 tasks on 16 HT threads. Theoretically, 2 case can be 2*8/9 = 1.78 times slower. But running 16 tasks on 16 HT threads should not be much slower.
For me this is relatively trivial - because it is a design feature of the threading toolkit QuickThread.
You can use Lambda functions format or not
Using Lambda function call syntax:
OneEach_L1$, // one thread per L1 on system
0, n, // half open range (size of array dimention to slice)
[&](intptr_t iBegin, intptr_t iEnd)
// make 2nd slice by HT (SMT) siblings
[&](intptr_t iSMTthread, intptr_t nSMTthreads)
// iBegin, iEnd = our SMT team's half open range
// iSMTthread = team member number of our SMT team (0/1)
// nSMTthreads = number of team members of outSMT team (1/2)
... your code to process array1 and array2
Using naked functions
OneEach_L1$, // one thread per L1 on system
firstSlice, // function address
0, n, // half open range
array1, array2); // function arguments
// each of one thread per L1calls here as task
void firstSlice(intptr_t iBegin, intptr_t iEnd, double* a1, double* a2)
// make 2nd slice by HT (SMT) siblings
parallel_distribute( L1$, secondSlice, iBegin, iEnd, a1, a2);
// all threads call
intptr_t iSMTthread, // 0 or 1 (larger for future SMT designs)
int_ptr_t nSMTthreads, //1 = no HT, 2 = HT,(larger for future SMT designs)
intptr_t iBegin, int iEnd, //half open range from priortask
double* a1, double* a2) // args from prior task
In the above examples, parallel_distribute was used in place of parallel_for to make the second slice due to HT slices generally work better when sliced differently than taking half of iteration space. An example might be both siblings using the same row on one array and different rows on the second array (or same row on one array and alternate cache lines on the same row of the second array). As to what works best, this will depend on the problem at hand.
Most of the QuickThread parallel_xxx constructs accept an optionalpalcement (or thread association argument).
Thread teams can be established
a) by association (e.g. within current thread socket, or say L1/L2/L3 cache)
b) by dissassociation (Exclude current thread cache level n)
c) by sparse distribution (e.g. OneEach_L3$ for one thread per L3)
d) any of the above withing a proximity of cache levels
e) availability (not currently scheduled)
f) hot in cache or not in cache
g) specific thread nubmer(s)
h) FIFO order or non-FIFO
i) as completion task
j) as I/O task
and a few other abstruse uses.
The team establishment is performed using a single optional argument to the parallel_xxx construct. It is relatively easy to use, and becomes second nature after you get the hang of it.
Also, it has good portability. The same program you write customized for a 4 socket Nehalem 7650 will run on a single core system with or without HT.
L1$, // form team using HT siblings of my core
[&](intptr_t, iThread, intptr_t nThreads)
// system without HT iThread = 0, nThreads = 1
Similar portable capabilities between one socket and n socket systems.