Intel® oneAPI Threading Building Blocks
Ask questions and share information about adding parallelism to your applications when using this threading library.

Thread-pool and affinity

Prasun_Gera__Intel_
3,015 Views

I started reading the TBB documentation a couple of days back to assess whether I can use TBB for my current project or not. I have a few questions. How does TBB maintain its internal thread pool? Is it completely transparent to the user? Is there any way of binding the TBB threads to the cores? I figured that we can set the affinity of the tasks relative to the threads in TBB. If I can set the affinity of the threads relative to the cores, I would be able to set the affinity of the tasks relative to the cores, which seems to be an important requirement for the project.

Regards,
Prasun
0 Kudos
1 Solution
robert-reed
Valued Contributor II
3,015 Views
I hope this makes things a bit clearer. As it is validation tool, the main function of the tool is to make the system undergo a lot of different test cases and validate the results later. Hence, I want control in phase B. The purpose of choosing different hardware thread bindings is to generate different test scenarios and different kinds of traffic. And the actual execution (phase B) has to be synchronized in that all the threads have to start execution at the same time, for example to test the system for something like memory ordering.


Well, assuming these tests operate within the premptive time slice of the controlling OS, you can guarantee that a test will run to completion on a particular HW thread by just not returning from the associated function until the test is done; that is, make the test a single, indivisible task. Getting a particular task to run on a particular HW thread might be challenging; perhaps easier would be to have the dispatched task identify which HW thread it is and dispatch to a particular function once the task is executed.

The last issue is synchronization. Specifically, what you want is a barrier sync on the tasks that are staging themselves for phase B. You could probably hack one together using a TBB atomic integer to keep track of the threads as they arrive at the rendezvous:


atomic int threads_left = number_of_threads();

...

tbb::parallel_invoke(test1(), test2(), test3(), ...);


where each of the test functions have a preamble something like this


test1() {

--threads_left;
while (threads_left > 0)
_mm_pause();

// do the test ...
}


This is a hack that I just threw together with possibly incorrect syntaxand I'm sure Dmitryi, Rafand others may find flaws in my memory fencing, but this conveys the general idea. You'll need to know the number of available threads (the number of tests to run simultaneously), represented here by number_of_threads(). Once the tasks are spawned (represented here by the parallel_invoke) each arrives at its barrier. Each indicates that it has arrived by decrementing the atomic and then spins until all have arrived. The _mm_pause() here is to represent that on Simultaneous Multi-Threading hardware you might want to use a pause instruction or intrinsic to provide an opportunity for all the HW threads to geta chance to spin. Note that this will cause a lot of thrashing on the particular range of virtual memory addresses representing the cache line containing that variable and since only one HW thread can own the cache line for writing at a time, there will be an inevitable cascade and slight latency in the resumption of threads that have to wait to own the cache line. But I think that's as good as you're going to get. The parallel_invoke includes its own barrier so that may not be what you'd want to use in practice; rather, you may just have the stage B tasks dispatch the stage C tasks immediately upon completion and free up the pool threads to start on the next stage A.

View solution in original post

0 Kudos
11 Replies
Bartlomiej
New Contributor I
3,015 Views

Is it completely transparent to the user?

The very idea of TBB is it was as transparent as possible.

Is there any way of binding the TBB threads to the cores?

Certainly there is no direct API for this, but IMHO there is a simple workaround if only you have:
- a function toset the affinity of kernel threads - in Linux there is a pthread_attr_setaffinity_np() function, on Windows there must be a relative,
- some API for per-thread data - and even if you don't know the API on your system, TBB gives you a great (in my opinion) enumerable_thread_specific class.

So now you can have a per-thread data of the form:

struct tls{
int thread_id;
bool affinity_assigned;
//... possibly sth more...

tls (int id) : thread_id(id), affinity_assigned(false) {};
};

Now set:
enumerable_thread_specific per_thread;

And in each task do:
my = per_thread.local();
if (!my.affinity_assigned) {
pthread_attr_set_affinity_np (...);
my.affinity_assigned = true;
}

IMHO this should work fine.
Best regards
0 Kudos
Alexey-Kukanov
Employee
3,015 Views
The other possibility to bind TBB threads to cores with a system-specific API is via task_scheduler_observer class.
I believe it was discussed at the forum before, and there isthe TBB Reference for the starting point. If you decide to try it and have specific questions, ask here, or write me a note.
0 Kudos
Prasun_Gera__Intel_
3,015 Views
@ Bartlomiej: Thanks for your suggestion. Will try it out and come back to the forum for further queries.
@Alexey: Thanks to you too. Will look into the task_scheduler_observer class and will get back to you for further queries.
0 Kudos
pvonkaenel
New Contributor III
3,015 Views
The other possibility to bind TBB threads to cores with a system-specific API is via task_scheduler_observer class.
I believe it was discussed at the forum before, and there isthe TBB Reference for the starting point. If you decide to try it and have specific questions, ask here, or write me a note.

I'm coming up to something similar and from reading the reference manual it looks like the task_scheduler_observer is a great way to modify internal behavior at the application level. On questions though. It sounds like you need to create your observer instance before the threads are created so that on_scheduler_entry can be called. Is that correct?
0 Kudos
pvonkaenel
New Contributor III
3,015 Views
Quoting - pvonkaenel

I'm coming up to something similar and from reading the reference manual it looks like the task_scheduler_observer is a great way to modify internal behavior at the application level. On questions though. It sounds like you need to create your observer instance before the threads are created so that on_scheduler_entry can be called. Is that correct?

Never mind, I should have read the documentation more carefully. You can create the observer either before or after the tasks have been created.
0 Kudos
Prasun_Gera__Intel_
3,015 Views
Quoting - Bartlomiej

Is it completely transparent to the user?

The very idea of TBB is it was as transparent as possible.

Is there any way of binding the TBB threads to the cores?

Certainly there is no direct API for this, but IMHO there is a simple workaround if only you have:
- a function toset the affinity of kernel threads - in Linux there is a pthread_attr_setaffinity_np() function, on Windows there must be a relative,
- some API for per-thread data - and even if you don't know the API on your system, TBB gives you a great (in my opinion) enumerable_thread_specific class.

So now you can have a per-thread data of the form:

struct tls{
int thread_id;
bool affinity_assigned;
//... possibly sth more...

tls (int id) : thread_id(id), affinity_assigned(false) {};
};

Now set:
enumerable_thread_specific per_thread;

And in each task do:
my = per_thread.local();
if (!my.affinity_assigned) {
pthread_attr_set_affinity_np (...);
my.affinity_assigned = true;
}

IMHO this should work fine.
Best regards

Hi,

I tried what you suggested and it seems to be working. I need to be clearer about certain things to make progress. This is what I have gathered so far.enumerable_thread_specific maintains thread-local data. Usingenumerable_thread_specific I can bind pthreads to desired cores, and know which threads are bound to which cores. My original problem, if you might recall, was to be able to run certain tasks on certain cores. After reading the documentation, I figured that the affinity of tasks to threads is just a hint, and not enforced strictly in the sense that it can be stolen by other threads. Instead, if I usepthread_attr_set_affinity_np directly in the task, would that ensure that the task would run on the desired core? It doesn't seem likely, because if the thread is already bound to a core and is executing some code, it wouldn't be able to change the binding in the middle of execution right? Unless of course, it has a mechanism to facilitate this sort of a context switch, save the state of the thread that already running on the core to which you want to bind the current thread, resume it later etc.

I suppose I should describe the entire problem to convey what I want to achieve. The application I am using has three phases, say A, B and C which are executed in that order. Now, I am not concerned with the particular bindings in phases A and C. Phase A generates a lot of code that phase B has to execute. The execution of phase B has to start in a synchronized fashion, viz. all threads have to start executing their code simultaneously. After, phase B is over, phase C can start validating results of phase B and proceed in the same way as phase A. So, I can use TBBs templates, constructs etc. conveniently for phases A and C. Phase B is where I want to run certain tasks on certain cores in a synchronized fashion.

This makes me wonder if can use just pthreads for phase B and TBB for phases A and C. Is that a good idea? Does that mean stopping TBBs scheduler after phase A and starting it again after phase B? I read in the documentation that we can use threads along with TBB. I havent explored that fully yet. Otherwise, I would have to ensure that each task of phase B goes to the desired TBB queue, and they start executing simultaneously.

The other problem is that phase B might not use all the threads. Say, phase A generates code to be run on 4 physical threads on an 8 core machine. Phase A uses all the cores for generation. Now, while phase B is executing on 4 cores (assuming I have achieved it somehow), I would want something else to run on the remaining cores. Possibly, the phase A of another instance of the application.

If I can summarize the problem, is there any way TBB would relinquish the control of certain physical threads in the middle of a program, (by possibly stealing the tasks slated to run on those threads and migrating them to other threads) and assume control again later when asked to? Or is there any way you can have direct control over the TBB queues if necessary for certain parts?

Seems like a very specific problem, and a rather long post for it. Any help/suggestions/workarounds are appreciated.

Regards,

Prasun

0 Kudos
Vivek_Rajagopalan
3,015 Views

Hi,

I Instead, if I usepthread_attr_set_affinity_np directly in the task, would that ensure that the task would run on the desired core? It doesn't seem likely, because if the thread is already bound to a core and is executing some code, it wouldn't be able to change the binding in the middle of execution right? Unless of course, it has a mechanism to facilitate this sort of a context switch, save the state of the thread that already running on the core to which you want to bind the current thread, resume it later etc.



From the manpage : http://www.kernel.org/doc/man-pages/online/pages/man3/pthread_setaffinity_np.3.html

The pthread_setaffinity_np() sets the CPU affinity mask of the thread thread
to the CPU set pointed to by cpuset. If the call is successful, and the
thread is not currently running on one of the CPUs in cpuset, then it is
migrated to one of those CPUs.


I dont know if the migration is immediate or at the next scheduling interval for the process.


0 Kudos
robert-reed
Valued Contributor II
3,015 Views
Quoting - Gera Prasun Dineshkumar (Intel):
My original problem, if you might recall, was to be able to run certain tasks on certain cores.

Well, actually, that is the method you are using to try to solve an underlying problem that you have not explained.

Quoting - Gera Prasun Dineshkumar (Intel):
I suppose I should describe the entire problem to convey what I want to achieve. The application I am using has three phases, say A, B and C which are executed in that order. Now, I am not concerned with the particular bindings in phases A and C. Phase A generates a lot of code that phase B has to execute. The execution of phase B has to start in a synchronized fashion, viz. all threads have to start executing their code simultaneously. After, phase B is over, phase C can start validating results of phase B and proceed in the same way as phase A. So, I can use TBBs templates, constructs etc. conveniently for phases A and C. Phase B is where I want to run certain tasks on certain cores in a synchronized fashion.

Some of the requirements described here send up little red flags. "All threads have to start executing the code simultaneously"? "...run certain tasks on certain cores ina synchronized fashion"? It sounds like you might be trying to set up some sort of producer-consumer pattern and still it's not clear why the requirement that certain tasks be constrained to certain HW threads.

Quoting - Gera Prasun Dineshkumar (Intel):
This makes me wonder if can use just pthreads for phase B and TBB for phases A and C. Is that a good idea?

Certainly you could create a ptheads pool and a TBB pool that could exist simultaneously, but you'd want to take care that only one pool is active (not sitting in an idle or a yield loop) at a time or suffer from a thread thrashing problem that would limit performance. The question is whether this much hair is necessary.

Quoting - Gera Prasun Dineshkumar (Intel):
The other problem is that phase B might not use all the threads. Say, phase A generates code to be run on 4 physical threads on an 8 core machine. Phase A uses all the cores for generation. Now, while phase B is executing on 4 cores (assuming I have achieved it somehow), I would want something else to run on the remaining cores. Possibly, the phase A of another instance of the application.

Now the need for HW thread affinity becomes even more questionable to me. Heretofore I thought that phase A was intended to set up some data (which you called code) in thecaches ofparticular HW threads that would be exploited at the start of phase B and thus the constraint to stay on a particular thread. If the number of active threads changes at the start of B, any accumulated and cached data will evaporate as phase B proceeds.

Quoting - Gera Prasun Dineshkumar (Intel):
If I can summarize the problem, is there any way TBB would relinquish the control of certain physical threads in the middle of a program, (by possibly stealing the tasks slated to run on those threads and migrating them to other threads) and assume control again later when asked to? Or is there any way you can have direct control over the TBB queues if necessary for certain parts?

The simple answer for TBB regarding relinquishing control of threads in the middle of a program is "not yet." However, the TBB task scheduler is non-premptive so tasks scheduled to a particular thread will only give up control when they return (or when the premptive OS process scheduler does a process switch). If it doesn't matter which specific HW thread a particular task gets assigned to (only that it stay there for the duration of the task), and you're not wedded to the notion that the number of tasks may vary between phases A and B, perhaps you could just meld B into the backside of A and define a rendezvous point (TBB doesn't provide a semaphore but maybeyou could construct one using TBB locks and a little bit of hackery) where threads finishing phase A wait until all are done, before being released to proceed with B.

I don't know if this is useful advice or not even close to the mark. Knowing more about the affinity constraint and its bearing on your algorithmassuming you can share such detailswould help.

0 Kudos
Bartlomiej
New Contributor I
3,015 Views
If I can summarize the problem, is there any way TBB would relinquish the control of certain physical threads in the middle of a program, (by possibly stealing the tasks slated to run on those threads and migrating them to other threads) and assume control again later when asked to? Or is there any way you can have direct control over the TBB queues if necessary for certain parts?

From your first post I understood that setting affinity of threads to cores is sufficient. Setting affinity of tasks means (as far as I understand) that we have to prevent task stealing. In this case I can see two ways:

(i) Each task stores an information which thread should do it. Each thread has a dedicated queue (an array of concurrent queues in shared memory) - initially empty. When starting a task a thread checks if the task belongs to it - if not it enqueues the task to the queue of its owner. After finishing all jobs each thread checks his dedicated queue and executes tasks from it. Hopefully task stealing won't occur often and there will be only few such "lost" tasks.

(ii) Implement the task mechanizm on your own - use say tbb_thread class (or even POSIX threads) and a (concurrent?) queue for each thread. The previous phase enqueues dedicated tasks to proper queues. No migration is done as you simply don't implement it.

I hope this will be helpful. If not - possibly you should use the variant with this task scheduler observer - I had no time to read about this TBB component yet.

Best regards

0 Kudos
Prasun_Gera__Intel_
3,015 Views

I don't know if this is useful advice or not even close to the mark. Knowing more about the affinity constraint and its bearing on your algorithmassuming you can share such detailswould help.

Hi,

The application I am talking about is a system validation tool. When I say phase A generates code for phase B to execute, I mean that phase A creates a code layout and a data layout, and adds instructions to the code layout. For example, phase A can call a function called construct_load or construct_store which would add the instruction with its opcode, arguments etc at an address in the code layout. So when phase A is done with setting up everything, the IP would switch to the beginning of the code layout and phase B would commence, which is nothing but the execution of the instructions set up by phase A. After phase B finishes, phase C can validate the results.

I hope this makes things a bit clearer. As it is validation tool, the main function of the tool is to make the system undergo a lot of different test cases and validate the results later. Hence, I want control in phase B. The purpose of choosing different hardware thread bindings is to generate different test scenarios and different kinds of traffic. And the actual execution (phase B) has to be synchronized in that all the threads have to start execution at the same time, for example to test the system for something like memory ordering.

Hence, I can use TBB conveniently for phases A and C. However, I need more control for phase B. And also, as I said, for a particular test case, if I am using only a few physical threads in phase B, I would want the remaining ones to do some other work, which necessitates a solution that can make threads work along with TBB, or overriding TBBs functionality sufficiently to achieve a task based implementation where the phase B tasks go the desired TBB dequeues, and their execution is synchronized.

Regards,

Prasun

0 Kudos
robert-reed
Valued Contributor II
3,016 Views
I hope this makes things a bit clearer. As it is validation tool, the main function of the tool is to make the system undergo a lot of different test cases and validate the results later. Hence, I want control in phase B. The purpose of choosing different hardware thread bindings is to generate different test scenarios and different kinds of traffic. And the actual execution (phase B) has to be synchronized in that all the threads have to start execution at the same time, for example to test the system for something like memory ordering.


Well, assuming these tests operate within the premptive time slice of the controlling OS, you can guarantee that a test will run to completion on a particular HW thread by just not returning from the associated function until the test is done; that is, make the test a single, indivisible task. Getting a particular task to run on a particular HW thread might be challenging; perhaps easier would be to have the dispatched task identify which HW thread it is and dispatch to a particular function once the task is executed.

The last issue is synchronization. Specifically, what you want is a barrier sync on the tasks that are staging themselves for phase B. You could probably hack one together using a TBB atomic integer to keep track of the threads as they arrive at the rendezvous:


atomic int threads_left = number_of_threads();

...

tbb::parallel_invoke(test1(), test2(), test3(), ...);


where each of the test functions have a preamble something like this


test1() {

--threads_left;
while (threads_left > 0)
_mm_pause();

// do the test ...
}


This is a hack that I just threw together with possibly incorrect syntaxand I'm sure Dmitryi, Rafand others may find flaws in my memory fencing, but this conveys the general idea. You'll need to know the number of available threads (the number of tests to run simultaneously), represented here by number_of_threads(). Once the tasks are spawned (represented here by the parallel_invoke) each arrives at its barrier. Each indicates that it has arrived by decrementing the atomic and then spins until all have arrived. The _mm_pause() here is to represent that on Simultaneous Multi-Threading hardware you might want to use a pause instruction or intrinsic to provide an opportunity for all the HW threads to geta chance to spin. Note that this will cause a lot of thrashing on the particular range of virtual memory addresses representing the cache line containing that variable and since only one HW thread can own the cache line for writing at a time, there will be an inevitable cascade and slight latency in the resumption of threads that have to wait to own the cache line. But I think that's as good as you're going to get. The parallel_invoke includes its own barrier so that may not be what you'd want to use in practice; rather, you may just have the stage B tasks dispatch the stage C tasks immediately upon completion and free up the pool threads to start on the next stage A.

0 Kudos
Reply