Solved: Thread-pool and affinity

Prasun_Gera__Intel_ · ‎09-04-2009

I started reading the TBB documentation a couple of days back to assess whether I can use TBB for my current project or not. I have a few questions. How does TBB maintain its internal thread pool? Is it completely transparent to the user? Is there any way of binding the TBB threads to the cores? I figured that we can set the affinity of the tasks relative to the threads in TBB. If I can set the affinity of the threads relative to the cores, I would be able to set the affinity of the tasks relative to the cores, which seems to be an important requirement for the project.

Regards,

Prasun

robert-reed · ‎09-22-2009

Quoting - Gera Prasun Dineshkumar (Intel)

I hope this makes things a bit clearer. As it is validation tool, the main function of the tool is to make the system undergo a lot of different test cases and validate the results later. Hence, I want control in phase B. The purpose of choosing different hardware thread bindings is to generate different test scenarios and different kinds of traffic. And the actual execution (phase B) has to be synchronized in that all the threads have to start execution at the same time, for example to test the system for something like memory ordering.

Well, assuming these tests operate within the premptive time slice of the controlling OS, you can guarantee that a test will run to completion on a particular HW thread by just not returning from the associated function until the test is done; that is, make the test a single, indivisible task. Getting a particular task to run on a particular HW thread might be challenging; perhaps easier would be to have the dispatched task identify which HW thread it is and dispatch to a particular function once the task is executed.

The last issue is synchronization. Specifically, what you want is a barrier sync on the tasks that are staging themselves for phase B. You could probably hack one together using a TBB atomic integer to keep track of the threads as they arrive at the rendezvous:

atomic int threads_left = number_of_threads();

...

tbb::parallel_invoke(test1(), test2(), test3(), ...);

where each of the test functions have a preamble something like this

test1() {

--threads_left;
while (threads_left > 0)
_mm_pause();

// do the test ...
}

This is a hack that I just threw together with possibly incorrect syntaxand I'm sure Dmitryi, Rafand others may find flaws in my memory fencing, but this conveys the general idea. You'll need to know the number of available threads (the number of tests to run simultaneously), represented here by number_of_threads(). Once the tasks are spawned (represented here by the parallel_invoke) each arrives at its barrier. Each indicates that it has arrived by decrementing the atomic and then spins until all have arrived. The _mm_pause() here is to represent that on Simultaneous Multi-Threading hardware you might want to use a pause instruction or intrinsic to provide an opportunity for all the HW threads to geta chance to spin. Note that this will cause a lot of thrashing on the particular range of virtual memory addresses representing the cache line containing that variable and since only one HW thread can own the cache line for writing at a time, there will be an inevitable cascade and slight latency in the resumption of threads that have to wait to own the cache line. But I think that's as good as you're going to get. The parallel_invoke includes its own barrier so that may not be what you'd want to use in practice; rather, you may just have the stage B tasks dispatch the stage C tasks immediately upon completion and free up the pool threads to start on the next stage A.

View solution in original post

Bartlomiej · ‎09-04-2009

Is it completely transparent to the user?

The very idea of TBB is it was as transparent as possible.

Is there any way of binding the TBB threads to the cores?

Certainly there is no direct API for this, but IMHO there is a simple workaround if only you have:
- a function toset the affinity of kernel threads - in Linux there is a pthread_attr_setaffinity_np() function, on Windows there must be a relative,
- some API for per-thread data - and even if you don't know the API on your system, TBB gives you a great (in my opinion) enumerable_thread_specific class.

So now you can have a per-thread data of the form:

struct tls{
int thread_id;
bool affinity_assigned;
//... possibly sth more...

tls (int id) : thread_id(id), affinity_assigned(false) {};
};

Now set:
enumerable_thread_specific per_thread;

And in each task do:
my = per_thread.local();
if (!my.affinity_assigned) {
pthread_attr_set_affinity_np (...);
my.affinity_assigned = true;
}

IMHO this should work fine.
Best regards

Alexey-Kukanov · ‎09-04-2009

The other possibility to bind TBB threads to cores with a system-specific API is via task_scheduler_observer class.
I believe it was discussed at the forum before, and there isthe TBB Reference for the starting point. If you decide to try it and have specific questions, ask here, or write me a note.

Prasun_Gera__Intel_ · ‎09-05-2009

@ Bartlomiej: Thanks for your suggestion. Will try it out and come back to the forum for further queries.

@Alexey: Thanks to you too. Will look into the task_scheduler_observer class and will get back to you for further queries.

pvonkaenel · ‎09-08-2009

Quoting - Alexey Kukanov (Intel)

The other possibility to bind TBB threads to cores with a system-specific API is via task_scheduler_observer class.
I believe it was discussed at the forum before, and there isthe TBB Reference for the starting point. If you decide to try it and have specific questions, ask here, or write me a note.

I'm coming up to something similar and from reading the reference manual it looks like the task_scheduler_observer is a great way to modify internal behavior at the application level. On questions though. It sounds like you need to create your observer instance before the threads are created so that on_scheduler_entry can be called. Is that correct?

pvonkaenel · ‎09-08-2009

Quoting - pvonkaenel

I'm coming up to something similar and from reading the reference manual it looks like the task_scheduler_observer is a great way to modify internal behavior at the application level. On questions though. It sounds like you need to create your observer instance before the threads are created so that on_scheduler_entry can be called. Is that correct?

Never mind, I should have read the documentation more carefully. You can create the observer either before or after the tasks have been created.

Prasun_Gera__Intel_ · ‎09-17-2009

Quoting - Bartlomiej

Is it completely transparent to the user?

The very idea of TBB is it was as transparent as possible.

Is there any way of binding the TBB threads to the cores?

Certainly there is no direct API for this, but IMHO there is a simple workaround if only you have:
- a function toset the affinity of kernel threads - in Linux there is a pthread_attr_setaffinity_np() function, on Windows there must be a relative,
- some API for per-thread data - and even if you don't know the API on your system, TBB gives you a great (in my opinion) enumerable_thread_specific class.

So now you can have a per-thread data of the form:

struct tls{
int thread_id;
bool affinity_assigned;
//... possibly sth more...

tls (int id) : thread_id(id), affinity_assigned(false) {};
};

Now set:
enumerable_thread_specific per_thread;

And in each task do:
my = per_thread.local();
if (!my.affinity_assigned) {
pthread_attr_set_affinity_np (...);
my.affinity_assigned = true;
}

IMHO this should work fine.
Best regards

Hi,

I tried what you suggested and it seems to be working. I need to be clearer about certain things to make progress. This is what I have gathered so far.enumerable_thread_specific maintains thread-local data. Usingenumerable_thread_specific I can bind pthreads to desired cores, and know which threads are bound to which cores. My original problem, if you might recall, was to be able to run certain tasks on certain cores. After reading the documentation, I figured that the affinity of tasks to threads is just a hint, and not enforced strictly in the sense that it can be stolen by other threads. Instead, if I usepthread_attr_set_affinity_np directly in the task, would that ensure that the task would run on the desired core? It doesn't seem likely, because if the thread is already bound to a core and is executing some code, it wouldn't be able to change the binding in the middle of execution right? Unless of course, it has a mechanism to facilitate this sort of a context switch, save the state of the thread that already running on the core to which you want to bind the current thread, resume it later etc.

I suppose I should describe the entire problem to convey what I want to achieve. The application I am using has three phases, say A, B and C which are executed in that order. Now, I am not concerned with the particular bindings in phases A and C. Phase A generates a lot of code that phase B has to execute. The execution of phase B has to start in a synchronized fashion, viz. all threads have to start executing their code simultaneously. After, phase B is over, phase C can start validating results of phase B and proceed in the same way as phase A. So, I can use TBBs templates, constructs etc. conveniently for phases A and C. Phase B is where I want to run certain tasks on certain cores in a synchronized fashion.

This makes me wonder if can use just pthreads for phase B and TBB for phases A and C. Is that a good idea? Does that mean stopping TBBs scheduler after phase A and starting it again after phase B? I read in the documentation that we can use threads along with TBB. I havent explored that fully yet. Otherwise, I would have to ensure that each task of phase B goes to the desired TBB queue, and they start executing simultaneously.

The other problem is that phase B might not use all the threads. Say, phase A generates code to be run on 4 physical threads on an 8 core machine. Phase A uses all the cores for generation. Now, while phase B is executing on 4 cores (assuming I have achieved it somehow), I would want something else to run on the remaining cores. Possibly, the phase A of another instance of the application.

If I can summarize the problem, is there any way TBB would relinquish the control of certain physical threads in the middle of a program, (by possibly stealing the tasks slated to run on those threads and migrating them to other threads) and assume control again later when asked to? Or is there any way you can have direct control over the TBB queues if necessary for certain parts?

Seems like a very specific problem, and a rather long post for it. Any help/suggestions/workarounds are appreciated.

Regards,

Prasun

Vivek_Rajagopalan · ‎09-17-2009

Quoting - Gera Prasun Dineshkumar (Intel)

Hi,

I Instead, if I usepthread_attr_set_affinity_np directly in the task, would that ensure that the task would run on the desired core? It doesn't seem likely, because if the thread is already bound to a core and is executing some code, it wouldn't be able to change the binding in the middle of execution right? Unless of course, it has a mechanism to facilitate this sort of a context switch, save the state of the thread that already running on the core to which you want to bind the current thread, resume it later etc.

From the manpage : http://www.kernel.org/doc/man-pages/online/pages/man3/pthread_setaffinity_np.3.html

The pthread_setaffinity_np() sets the CPU affinity mask of the thread thread
to the CPU set pointed to by cpuset. If the call is successful, and the
thread is not currently running on one of the CPUs in cpuset, then it is
migrated to one of those CPUs.

I dont know if the migration is immediate or at the next scheduling interval for the process.

robert-reed · ‎09-18-2009