Intel® oneAPI Threading Building Blocks
Ask questions and share information about adding parallelism to your applications when using this threading library.

Using core(-s) specification

IKlm
Beginner
537 Views

Hi. I'm novice at parallel programming.

I would like to know if there is some possibility to set on which fixed cores of multi-core processor I want to execute my program?

I have problems then try to use less threads than logical cores are at Intel Xeon X5550, for example 5 threads. Execution time depend dramatically of on which cores OS run program. And will be good if I in some way can control it.

0 Kudos
1 Solution
Dmitry_Vyukov
Valued Contributor I
537 Views
Quoting - IKlm
I have problems then try to use less threads than logical cores are at Intel Xeon X5550, for example 5 threads. Execution time depend dramatically of on which cores OS run program. And will be good if I in some way can control it.

Hi. I'm novice at parallel programming.

I would like to know if there is some possibility to set on which fixed cores of multi-core processor I want to execute my program?


In order to setup TBB to use only 5 worker threads, pass 5 to task_scheduler_init constructor.
Then create your task_scheduler_observer and setup required affinity for worker threads in task_scheduler_observer::on_scheduler_entry(). You need to use platform specific affinity API (SetThreadAffinityMask(), pthread_setaffinity_np()).

View solution in original post

0 Kudos
8 Replies
Dmitry_Vyukov
Valued Contributor I
538 Views
Quoting - IKlm
I have problems then try to use less threads than logical cores are at Intel Xeon X5550, for example 5 threads. Execution time depend dramatically of on which cores OS run program. And will be good if I in some way can control it.

Hi. I'm novice at parallel programming.

I would like to know if there is some possibility to set on which fixed cores of multi-core processor I want to execute my program?


In order to setup TBB to use only 5 worker threads, pass 5 to task_scheduler_init constructor.
Then create your task_scheduler_observer and setup required affinity for worker threads in task_scheduler_observer::on_scheduler_entry(). You need to use platform specific affinity API (SetThreadAffinityMask(), pthread_setaffinity_np()).

0 Kudos
IKlm
Beginner
537 Views
Quoting - Dmitriy Vyukov

Then create your task_scheduler_observer and setup required affinity for worker threads in task_scheduler_observer::on_scheduler_entry(). You need to use platform specific affinity API (SetThreadAffinityMask(), pthread_setaffinity_np()).

Thank you. This works good.

0 Kudos
IKlm
Beginner
537 Views
Now I have such strange situation:
I run 2 threads on X5550 (which have 4 cores and Hyper-Threading). And then I get speed up factor about 2.0, as should be.
I run 4 threads on different logical cores (0123 or 0145 or 0157 for example - xyvw there are 4 coreId of 4 logical cores), but I always get speed up factor 2.5.
This is strange. Then I use 4 logical cores which correspond to 2 physical core (use HT) should be much higher overhead then I use 4 logical cores which correspond to 4 different physical core. Or not?

I use parallel_for for done 2000 independent same tasks. Execution time of program is about 10 second.

Can somebody explain this?

P.S.
according to proc/cpu info:
cpuId = 0,4 correspond to 1-st phys core
cpuId = 1,5 correspond to 2-d phys core
cpuId = 2,6 correspond to 3-th phys core
cpuId = 3,7 correspond to 4-th phys core
So I can suppose that then I use 0123 I should get speed up factor about 4, then use 0145 I should get something about 2.5.
0 Kudos
IKlm
Beginner
537 Views
Quoting - IKlm
Now I have such strange situation:
I run 2 threads on X5550 (which have 4 cores and Hyper-Threading). And then I get speed up factor about 2.0, as should be.
I run 4 threads on different logical cores (0123 or 0145 or 0157 for example - xyvw there are 4 coreId of 4 logical cores), but I always get speed up factor 2.5.
This is strange. Then I use 4 logical cores which correspond to 2 physical core (use HT) should be much higher overhead then I use 4 logical cores which correspond to 4 different physical core. Or not?

I use parallel_for for done 2000 independent same tasks. Execution time of program is about 10 second.

Can somebody explain this?

P.S.
according to proc/cpu info:
cpuId = 0,4 correspond to 1-st phys core
cpuId = 1,5 correspond to 2-d phys core
cpuId = 2,6 correspond to 3-th phys core
cpuId = 3,7 correspond to 4-th phys core
So I can suppose that then I use 0123 I should get speed up factor about 4, then use 0145 I should get something about 2.5.

Problem was closed.

I declared time counter in procedure which are run parallel and forgot about it. This leads to increase single thread execution time into 2 times and lose in speed up, when I use many threads.

Sorry for trouble.

0 Kudos
IKlm
Beginner
537 Views
Quoting - Dmitriy Vyukov

In order to setup TBB to use only 5 worker threads, pass 5 to task_scheduler_init constructor.
Then create your task_scheduler_observer and setup required affinity for worker threads in task_scheduler_observer::on_scheduler_entry(). You need to use platform specific affinity API (SetThreadAffinityMask(), pthread_setaffinity_np()).

Hello

Dmitriy, may be you, or somebody else, can help me again. I should run specific different tasks on specific cores. But from within task_scheduler_observer::on_scheduler_entry() I can't determine to which task current thread correspond. I can determine threadId, but this is first time then thread have this Id (right?) and I can't set correspondence between task and threadId before observer.

I try to don't use observer and do code from its into begin of operator() of correspondent task, but I lose 50% of speed. (Actually this is will be intresting for me too - why observer work faster).

So, can somebody help me with such problem?

0 Kudos
Prasun_Gera__Intel_
537 Views

I had a somewhat similar requirement. You can find the related thread at
http://software.intel.com/en-us/forums/showthread.php?t=68135

As far as I know there is no direct way of binding a task to a core. You can use the affinity as a hint for the scheduler, but because of work stealing it doesn't guarantee that a particular task will get executed on a particular core. Till a particular task is actually executed, there is always a possibility of it being stolen by some other thread. So, as Robert said in the post that i referred, getting a particular task to run on a particular HW thread might be challenging; perhaps easier would be to have the dispatched task identify which HW thread it is and dispatch to a particular function once the task is executed.
0 Kudos
IKlm
Beginner
537 Views

I had a somewhat similar requirement. You can find the related thread at
http://software.intel.com/en-us/forums/showthread.php?t=68135

As far as I know there is no direct way of binding a task to a core. You can use the affinity as a hint for the scheduler, but because of work stealing it doesn't guarantee that a particular task will get executed on a particular core. Till a particular task is actually executed, there is always a possibility of it being stolen by some other thread. So, as Robert said in the post that i referred, getting a particular task to run on a particular HW thread might be challenging; perhaps easier would be to have the dispatched task identify which HW thread it is and dispatch to a particular function once the task is executed.

Thanks for link!

Task stealing is not a problem for me - I run only one task for one thread, so tasks not wait in pool.

0 Kudos
ARCH_R_Intel
Employee
537 Views
Having onlyone P tasks for P cores is usually not the best way to use TBB or any other task-stealing scheduler, because the scheduler will not be able to do load balancing. Ideally, the number of tasks should be much larger than the number of cores, so that the scheduler can balance load. The primary consideration in pick the task size should not be the number of cores, but amortizing scheduler overhead. About 10,000 instructions per task is typically suffices.

Though it may appeartrivial to write P tasks that each do 1/P of the work, it is not so on modern machines, because of page faults, cache misses, OS interruptions etc. It's like sharing an apple. Don't try to cut it into equal pieces. Make applesauce.

0 Kudos
Reply