Hi all, I'm a new guy in intel tbb world, and studying the code these days. It's amazing to me that the code seems not using hardware affinity to ensure load balancing between multiple cores, instead that it seems use task to logic thread mechanism. I don't know my understanding is right or not, can some one here to explaim more detail for me? Thanks very much in advance.
Why would hardware affinity assure load balance? All that would do is bind particular threads to particular hardware without ensuring that any particular thread doesn't get bound up with some work while the others sit idle.
As a TBB newbie, have you read the tutorial document available from the website? Chapter 11, on the task scheduler, talks about dividing tasks into subtasks and then dynamically assigning those tasks to threads as the threads become idle. The way to get load balance is to do this task subdivision so that that no particular thread gets bogged down with one task while the others sit idle.
Thanks for you reply. I do understand that TBB implement the load balancing through dispatch task over worker thread that running on hardware thread provided by multi-cores (I assume the traditional way is bind raw thread over different core using hardware affinity). Of course the way used by tbb can ensure all logic thread working busy with consideration of cache hit. Hereafter I explaim my question as following: how to ensure the logic thread (including master thread which initialize the task scheduler and worker thead spawn by RML) running on different cores (I assume each logic thread running on single core can ultilize all computing resource as much as possible). I fail to find the affinity bind between logic threads and cores.
I know of no "traditional" way of ensuring load balance via binding threads to particular cores; threads are sometimes pinned to particular cores to minimize latency in some real-time ISRs, but that's a very different beastie.
For many algorithms, affinity is a bad thing. It only restricts which threads can run on which cores: imagine the scheduler skipping the assignment of a thread to a free core because that core wasn't in the thread's affinity hint. That assures more idle time and less performance.
But there are algorithms where repeated processing of the same dataset can gain in performance by repeating the same segments on the same cores. There are a couple of ways that affinity can be introduced within TBB. One is through the affinity partitioner. This changes the scheduling of such constructs as parallel_for and parallel_reduce, employing affinity hints to maximize by region thread reuse. TBB also has the note_affinity() and set_affinity() functions to apply affinity hints per task.
If you are asking how software threads (pthreads,winthreads, so on) map onto hardware threads (cores, HT), this is not question to the TBB, it is job of the OS to map these logic threads to the available hardware resources. And in a general case, the best what you can do to help the OS is staying away :) This is why TBB does not bother with hardware affinity (pinning threads, but it provides means to do it on your own risk).
Thanks for your clarify, both. From your statement I understand that tbb never touch the affinity binding (even some interface is exported to allow change by user) between logic thread and multi-core. If all hard ware binding is belong to OS's work, assume we have a little complex appliation on a 4 core intel-i3. The application has 3 threads (A,B and C) internally (both of them can work parallel), and one of them (B) is accelated through TBB parallel_for. What is best number of hardware concurrency of this case? It seems preemptable will happen if we create hardware concurrency using default value = 4, because 6 (4+2) is exceed the capability of core number (4). Then we need to specifie hardware concurrency to 2 manually? Am I right?
Depends on what A & C are doing. If there is a chance they are sleeping or blocked on a syscall, such a setup will lead to undersubscription which is much worse than a little oversubscription. I'd decrease default concurrency level for TBB by a number of threads which are known doing heavy computations the same time when TBB is executing parallel algorithms.
>>I know of no "traditional" way of ensuring load balance via binding threads to particular cores; threads are sometimes pinned to particular cores to minimize latency in some real-time ISRs, but that's a very different beastie
In QuickThread (www.quickthreadprogramming.com) a compute class pool of threads is typically created equal to the number of logical processors permitted to the process, these are affinity pinned. Any number of I/O class threads can be instantiated at init time as well. Each of the affinity pinned compute class threads has its own task queues (multiple queues per compute class thread). When a thread completes a task (or in idle loop)task search follows a pecking order established at init time and modified at run time. The first queues to check is (are) its own, the next are its HT siblings, the next are its L2 siblings, next are itsL3 (socket) siblings, next are its NUMA node siblings, next one hop NUMA nodes, nexttwo hop NUMA nodes, next three hop NUMA node. Later revisions are intended to have MIC distance too. Load balancing comes from a combination of availability of thread by proximity to enqueuing thread. There are additional qualifiers to limit the distance from the enqueuing thread as well as exclusion and distribution by distance of enqueuing thread. Load balancing is in the control of the programmer should they need or desire such functionality. Also, a well balanced server app will run unchanged on your dual/quad core desktop or notebook.
Though I've never had time to play with quickthread, I am aware of the work you put into hierarchies of adjacency. With such strict controls over thread movement, it's certainly an advantage to be able to pin threads to particular cache hierarchies, but I would argue that the load balance has little to do with those pinnings to processors, but with the ability of your scheduler to share work equitably across the fabric. Again it's a case of being able to subdivide and equitably share the work that guarantees load balance, not the affinity of threads to parts of that fabric--the scheduling is the contributing feature, not the affinity.
"I would argue that the load balance has little to do with those pinnings
to processors, but with the ability of your scheduler to share work
equitably across the fabric." I do see it as an advantage if the scheduler knows about proximity when dividing work into tasks that should either run closely together (share data hot in cache, the default for recursive parallelism) or far apart (avoid thrashing, if explicitly specified). Jim should be able to back this intuition up with hard figures... or correct me if needed.
While such awareness may provide some icing on the cake, it is certainly not essential, as already indicated above. TBB does a good job without being NUMA-aware, and has provisons for affinity across loops (in provided algorithms) or as explicitly specified (for those using bare tasks). In most cases, movement of TBB-related threads across cores (mostly in the presence of oversubscription) should not significantly upset affinity (it is not necessarily catastrophic if data has to move from time to tiime but not too often), although that needs to be backed up with experiments (anyone?). In other words, you'll get a lot of low-hanging fruit by not worrying about the issue.
If you do decide to invest the development time to squeeze the last bit of performance from the machine, you could experiment with setting thread affinity in a task_scheduler_observer, which binds tasks to cores by way of threads (tasks to threads c/o TBB, threads to cores c/o your operating system's functions). As a self-professed beginner, you should not venture here, because you may very well make matters worse.
The HPC forum may be a good place to start to locate referenced research papers into the pros and cons of affinity aware and NUMA aware programming. In particular which apps are suitable and which are not. Although these papers will likely address solutions using pthreads as opposed to using say TBB or QuickThread, the performance improvement (or lack thereof) using a well tuned cache cognizant pthread application might be indicative of what can be expected with similar thread cache/proximity tuning using other threading models.
As a side note, QuickThreadon Linux platform is built on top of pthreads. A major advantage is you program as task model instead of (p)thread model.