Intel® oneAPI Threading Building Blocks
Ask questions and share information about adding parallelism to your applications when using this threading library.

TBB and hyper-threading

renorm
Beginner
1,227 Views
Is there a way to launch exactly one thread per core on Xeon with hyper-threading on? I have a situation where the number of grains available for parallel processing is limited. It is a do a step and then do another step type of algorithm. Steps are sequential, but each step has sufficient work to justify parallelization. At the same time each step doesn't have enough work to be split into many grains. It is a borderline situation where one has to balance between the number of grains and parallelization overhead. What happens if exactly 8 grains are available and exactly 8 tasks are launched on 8-core 16-thread Xeon? Will each task run on a separate core?

Any thought?
0 Kudos
17 Replies
jimdempseyatthecove
Honored Contributor III
1,227 Views

There may be a way to officially do this with TBB (like ippSetAffinity or KMP_AFFINITY=)
If not, then you can (by other means) set the process affinity mask to the logical processors representing one per core. Do this prior to starting the TBB thread pool. Then start the TBB thread pool with half the threads. Look at the TBB init for controls affecting affinity and/or related settings.

In QuickThread you can use

parallel_for(OnEach_L1$, fn, from, to, arg, arg, ...);

Which will establish a thread team using one thread per core even with HT enabled.

Jim Dempsey

0 Kudos
Andrey_Marochko
New Contributor III
1,227 Views
Normally you can just request the number of threads corresponding to the number of phisical cores from tbb::task_scheduler_init, and trust the OS to spread the threads one per core. From my obserbations it does exactly this if there are no other threads running concurrently (in your or other processes).

Another approach is to affinitize worker threads using tbb::task_scheduler_observer. But as was already dicussed in other threads, using hard affinity ruins composability of your application (resulting in drastic performance drop in case there are other workloads running concurrently).

Besides you'll need to know the mapping between bits in OS affinity mask and phisical CPU layout, and there is not straighforward way to this info from OS. Though it seems that using successive bits in the affinity mask often does what you want, there is not guarantees that any partcular machine follows this rule.
0 Kudos
jimdempseyatthecove
Honored Contributor III
1,227 Views
>>Another approach is to affinitize worker threads using tbb::task_scheduler_observer. But as was already dicussed in other threads, using hard affinity ruins composability of your application (resulting in drastic performance drop in case there are other workloads running concurrently).

When you have one TBB app running on your system, then you can affinitize all threads of the TBB app to float amongst one thread of each core. The mapping can be derrived from an environment variable that you set (if not present you can cross fingers and let the O/S manage this). Depending on O/S and configuration options, the mask value may vary. A seperate conditioner program can tell you the bit mapping (CPUID works well for this). I place the CPUID functions in my code but the environment variable would be more portable platform wise.

When you have two TBB apps running on your system with HT (or even without HT), then you will have to decide (through testing)if it is better to have them run on different HT siblings or on the same HT siblings. There are issues of L1cache evictions caused between HT siblings between the apps. L2 isn't as severe. In the two TBB app scenario running on same HT thread set you can incorporate SwitchToThread() to provide run bursts following the SwitchToThread() which will maintain/extend your L1 cache population by those softwarethreads sharing the core. This will add overhead so it has to be weighed against the benefit. You may want a counter that determines the number of additional TBB apps running on your system. Then enable/disable or modify this feature.

More than two TBB apps running will complicate the matter and at which point you may want to give up control to the O/S or happenchance.

Personally I prefer to use all HT siblings and control which through the adaptive scheduling I have available to my QuickThread programs. In this case an example would be parallel_for( OneEach_L1$, ... which says: use all cores but only one thread from eachHT siblings (if HT on system) without regard to which HT sibling grabs slice. In this way you do not need to coordinate with other parallel apps are running on the system. If one of the siblings is busy running a different app, the other will automatically take the slice. If both are running different apps they wait their turn (O/S scheduling). If they wait longer than time for other threads (one per core) take to complet slice, then one of those threads takes slice (you can control if slice grabbing is permitted, and by whom, or not).

Jim Dempsey
0 Kudos
renorm
Beginner
1,227 Views
Cache eviction is another issue.

The TBB app is the only program running on my system (except OS processes). I want to start 1 task per core and let OS to distribute tasks between cores.

What is the environment variable which tells the number of physical cores? If there is one, my program could use it. In there is none, I would need to instruct users to define one.
0 Kudos
smasherprog
Beginner
1,227 Views
TBB is not really designed to do what you want --I am not saying it cannot. However, it seems that you would best be served by not using any threading package and creating your own thread pool. The alternative is that you can use the task scheduler to spawn a task for each of your jobs, then run them in parallel.

I think that there is a better way. You might be looking at the problem from the wrong direction. For example, in a 3d rendering system, a common mistake is to put 3d rendering on one thread, physics on another, network on another, and so on. Now, there are specific reasons why a thread can be dedicated to one task only, generally, there are ways around it, which tend to lead to better performance --this is what TBB is designed for.

There are ALWAYS areas of code that can be coded for parallel processing. If you limit yourself to setting a job per physical code, then what happens when a 24 core version comes out? Parallelize your code at a deeper level... Find the paralelable (is this a word?1?) areas. If you cannot find any, then you arent looking hard enough :P
0 Kudos
Alexey-Kukanov
Employee
1,227 Views
One task per core is certainly not the approach TBB was designed for. You can do that if you wish, but it could be done easier/more naturalwith OpenMP and maybe even with manual threading.

I am unaware of any OS environment variable that tells the number of physical cores.
0 Kudos
jimdempseyatthecove
Honored Contributor III
1,227 Views
renorm,

You should take the advice of those on this forum, and the opportunity with your current requirements to re-think your problem in terms of tasks instead of threads. Once you do this you will find more opportunities for parallelization.

When thinking in threads (pthread approach) you generally take an application and decompose it by major function and assign one thread per major function. Then examine each major function to determine if they can be broken into sub-functions run in parallel (new set of threads), then examine the sub-functions to see if they can be further sub-divided, etc... The effectiveness of this method greatly varies with the number of hardware threads available and a lot of work is involved in analysis for decomposition. This work has to be redone when the platform changes.

Using tasks and thread pools (TBB and QuickThread method) leads to easire decision points as to how to partition the major functions, sub-functions and sub-sub functions. And this effort need not be redone using tasks to the extent that it may require when explicitly using threads.

Jim Dempsey
0 Kudos
renorm
Beginner
1,227 Views
I want to initialize task_scheduler_init class not to use max number of threads and let the TBB task scheduler do its work. For example, on 8 core, 16 thread Xeon I would initialize task_scheduler_init with 8 threads.

My code is arithmetic heavy and doesn't benefit from hyper-threading. It is not easily parallelizable, but I want to improve its scalability as much as possible. Due to the nature of the problem, data flow between threads is fairly intensive and threads operate on shared data. Blocking and false sharing must be avoided as much as possible. Probably my code is not scalable beyond 8 cores in typical situations.

Hope it clarified my situation.
0 Kudos
RafSchietekat
Valued Contributor III
1,227 Views
"Is there a way to launch exactly one thread per core on Xeon with hyper-threading on?"
Before you jump into a general solution, have you actually validated this hypothesis of better performance by using task_scheduler_init to set the number of threads with a hard-coded number specific to your test machine and running comparative benchmarks? I couldn't make that out from what's been written.

"I have a situation where the number of grains available for parallel processing is limited."
I don't think that's how "grains" is supposed to be used: you can set grainsize, but you don't have a number of grains available, nor do you split work into grains. Those are tasks, even if you can of course assemble several such tasks in one tbb::task, sometimes depending on grainsize. If I'm mistaken, please educate me, otherwise, please cease and desist, because it's confusing. :-) Although, having task and tbb::task mean different things is confusing, too. Anyone from the TBB team to the rescue?
0 Kudos
renorm
Beginner
1,227 Views
Consider a blocked range [0, 20). Parallel_for splits it into subranges [r(n-1), r(n)). Each subrange has r(n) - r(n-1) elements (or grains) and processed by a single thread. You can think of each subrange as a task. It is not derived from the tbb::task, though.

Now, what are problems?

The range is too small. Work balancing is an issue. The good news is that each element of the range represents more or less equal amount of work. On Xeon with 8 cores and 16 it makes no sense to use all 16 threads, but on Core i5 it makes sense to use all available threads.

Hope it clarifies my question.

0 Kudos
RafSchietekat
Valued Contributor III
1,227 Views
"Consider a blocked range [0, 20). Parallel_for splits it into subranges [r(n-1), r(n)). Each subrange has r(n) - r(n-1) elements (or grains) and processed by a single thread. You can think of each subrange as a task. It is not derived from the tbb::task, though."
I'm still waiting for feedback from a native speaker in the TBB team about this vocabulary problem. :-)

"The range is too small. Work balancing is an issue. The good news is that each element of the range represents more or less equal amount of work. On Xeon with 8 cores and 16 it makes no sense to use all 16 threads, but on Core i5 it makes sense to use all available threads."
My question is about your findings so far with different hard-coded values, to make this a practical matter instead of a theoretical one with just opinions?

(Replaced previous version in second paragraph a few minutes after posting it, sorry for any confusion.)
0 Kudos
smasherprog
Beginner
1,227 Views
If you can break it into 20 like that, is it possible for parallel_for to have a grainsize that will break your task up into 20 peices so that they can all be ran on different threads in parallel? If so, this would be faster than running on a single thread. Even if the tasks are large, there should still be some speedup over a single threaded.

TBB actually works better when there is ALOT of work to do for each task because there is some overhead assosiated with creating and running the tasks.
0 Kudos
renorm
Beginner
1,227 Views
I thought about hard wiring the number of threads. In my case the problem's size is not know at compile time. I will probably shift the burden to the users and let them to set the number of threads via config file or environment variable.
0 Kudos
jimdempseyatthecove
Honored Contributor III
1,227 Views
Your example of blocked range [0, 20)

16 threads: four threads run 2 iterations, twelve run 1 iteration
8 threads: four threads run 3 iterations, four threads run 2 iterations
2 threads: 2 threads run 10 iterations

Assume first two situations are your 8 cores with HT system
Assume third situation is processor with 2 cores no HT
Assume all cores same frequency same memory rating same cache, ... (identical)
Assume running HT sibling runs adds +30% aggregate improvement
Note, integer intensive (non-memory intensive) HT environment impact on HT is little to none (read 100% aggregate improvement)

For HT assume average is 1.30/2 = .65 performance per threadof non-HT thread or 1.54 slower

Assume an iteration on non-HT takes 1 unit of time

16 thread:1 *1.54 (first iteration by 16 threads) + (1 * 1) or (1 * 1.54) (depending on thread mix)
more importantly 12 threads are available to do additional work
latency = 2.54 to 3.08 units (random average 2.81 units)
available computational resources (12 * .65 * 1unit) = 7.8 units

8 thread: 2 * 1 (first two iterations by 8 threads) + 1 * 1 (third iteration by 4 threads)
with 4 threads available to do additional work (8 HT thread siblings idle)
latency = 3 units
available computational resources (4 * 1 unit) = 4 units

2 thread: 2 * 10
0 threads available to do additional work
latency 20 units
available computational resources 0 units

While the estimated HT configuration is marginally faster (6.76%) when looking at latencies you are forgetting to look at the available computational resources 7.8 units with HT versus 4 units without. Nearly twice as much residual (waiting) computational capacity. For simple test programs, this is invariably idle time, for real world applications, this is processing time available to do other work in your application.

TBB and other task model threading paradigms puts this available computational resourceback into your program.

Note, the HT effect rule of thumb (+30% overhead) applies to a random mix of programming. In some of the few favorable circumstances, when you can coordinate the HT siblings to work together (one makes use of L1/L2 loads by the other HT sibling) then you can attain at times a~500% boost in performance. Take some time to learn how to best use HT.

Jim Dempsey
0 Kudos
RafSchietekat
Valued Contributor III
1,227 Views

"I thought about hard wiring the number of threads. In my case the problem's size is not know at compile time. I will probably shift the burden to the users and let them to set the number of threads via config file or environment variable."
That's not what I meant. If it's a real issue, you should be able to demonstrate it with hard-coded values, otherwise it's imaginary and probably not worth any further discussion or exploration. So which is it?

0 Kudos
renorm
Beginner
1,227 Views
@Jim,
My case is FPU heavy SSE. It is moderately memory intensive. My tests showed no benefit from HT. I want to optimize single program execution and don't mind HT siblings being wasted, as long as my program runs faster. I think ~500% would not apply to to the main program (100% is theoretical limit).

Btw, how do you control HT siblings? I can imagine how to it only implicitly, by structuring the program's flow accordingly.

@Raf,
It is empirical, not theoretical. For example, running 8 tasks on 8 physical cores is better that running 9 tasks on 16 HT threads. Theoretically, 2 case can be 2*8/9 = 1.78 times slower. But running 16 tasks on 16 HT threads should not be much slower.
0 Kudos
jimdempseyatthecove
Honored Contributor III
1,227 Views
>>Btw, how do you control HT siblings? I can imagine how to it only implicitly, by structuring the program's flow accordingly.

For me this is relatively trivial - because it is a design feature of the threading toolkit QuickThread.
You can use Lambda functions format or not

Using Lambda function call syntax:

...
double** array1;
double** array2;
...
parallel_for(
OneEach_L1$, // one thread per L1 on system
0, n, // half open range (size of array dimention to slice)
[&](intptr_t iBegin, intptr_t iEnd)
{
// make 2nd slice by HT (SMT) siblings
parallel_distribute(
L1$,
[&](intptr_t iSMTthread, intptr_t nSMTthreads)
{
//here with
// iBegin, iEnd = our SMT team's half open range
// iSMTthread = team member number of our SMT team (0/1)
// nSMTthreads = number of team members of outSMT team (1/2)
... your code to process array1 and array2
}
});


Using naked functions

...
double** array1;
double** array2;
...
parallel_for(
OneEach_L1$, // one thread per L1 on system
firstSlice, // function address
0, n, // half open range
array1, array2); // function arguments
...

// each of one thread per L1calls here as task
void firstSlice(intptr_t iBegin, intptr_t iEnd, double* a1, double* a2)
{
// make 2nd slice by HT (SMT) siblings
parallel_distribute( L1$, secondSlice, iBegin, iEnd, a1, a2);
}

// all threads call
void secondSlice(
intptr_t iSMTthread, // 0 or 1 (larger for future SMT designs)
int_ptr_t nSMTthreads, //1 = no HT, 2 = HT,(larger for future SMT designs)
intptr_t iBegin, int iEnd, //half open range from priortask
double* a1, double* a2) // args from prior task
{
...
}

In the above examples, parallel_distribute was used in place of parallel_for to make the second slice due to HT slices generally work better when sliced differently than taking half of iteration space. An example might be both siblings using the same row on one array and different rows on the second array (or same row on one array and alternate cache lines on the same row of the second array). As to what works best, this will depend on the problem at hand.

Most of the QuickThread parallel_xxx constructs accept an optionalpalcement (or thread association argument).

Thread teams can be established

a) by association (e.g. within current thread socket, or say L1/L2/L3 cache)
b) by dissassociation (Exclude current thread cache level n)
c) by sparse distribution (e.g. OneEach_L3$ for one thread per L3)
d) any of the above withing a proximity of cache levels
e) availability (not currently scheduled)
f) hot in cache or not in cache
g) specific thread nubmer(s)
h) FIFO order or non-FIFO
i) as completion task
j) as I/O task

and a few other abstruse uses.

The team establishment is performed using a single optional argument to the parallel_xxx construct. It is relatively easy to use, and becomes second nature after you get the hang of it.

Also, it has good portability. The same program you write customized for a 4 socket Nehalem 7650 will run on a single core system with or without HT.

e.g.

parallel_distribute(
L1$, // form team using HT siblings of my core
[&](intptr_t, iThread, intptr_t nThreads)
{
// system without HT iThread = 0, nThreads = 1
});

Similar portable capabilities between one socket and n socket systems.

Jim Dempsey
0 Kudos
Reply