how to assign threads to each core in linux 2.6.31 kernel scheduler?

kewenpan · ‎04-08-2010

i have heard that recently linux 2.6.31 could support completly SMP architecture.
and then linux 2.6.31 also support CMP architeture ,but it can not make full use of multi-core performance.
there still would not be a mature os designed specially for multi-core arthitecture ,exspecially lack of perfect
scheduling algorithm.
i am uncertainty that how to assign threads to each core in modern multi-core architecture os,such as linux 2.6.31 .
that is what's the important basis of assigning threads to each core?
what's the basis of migrating threads when the scheduler launchs load balancing in multi-core architecture?
thank you for reading and looking forward to your advice.

jimdempseyatthecove · ‎04-08-2010

In a pthreads programming environment you typically have (in pseudo code)

launch thread A
launch thread B
...
launch thread Z
-----

thread A
initialization part of A
loop:
get item from message/queue (wait if necessary)
do work
if done or error then exit
goto loop

When you desire to add affinity pinning to the thread then the "initialization part of A" will require you to add code to determine what cores are available (what number of hardware threads are available), what cpi loads have already been allocated to each core, what cpi load functional code of A has, other factors (e.g. IO, FPU vs integer, etc..), then a determination of which core to use (hardware thead is made), then a system API is called to migrate the current software thread's execution to that core (or group of cores) and to restrict it to run on that (those) core(s). Then add to core(s) load the load value (CPI) for the functional code of the do work of A. This is repeated in each thread's initialization part (in a critical section)

Consult your system call API for functions relating to Affinity.

Jim Dempsey

TimP · ‎04-08-2010

Presumably, you still have available standard linux functionality such as numactl, taskset, and similar OpenMP or MPI environment settings, like GOMP_CPU_AFFINITY, KMP_AFFINITY, mpi_paffinity_alone, I_MPI_PIN_DOMAIN.
I don't see any publicized changes in pthread affinity functionality with this kernel version nor an indication that a major change in glibc is implied.

kewenpan · ‎04-08-2010

thank you ,Jim Dempsey
Your explantation on assigning threads to each core in multi-core architectureis perfect .
I learn more about this aspect.
I have another idea about load balancing in multi-core.
Could we considerthe averagecpi of core as a avaliable metric which reflect load situation in recent core when tick launch load balance ?
so that we change some threads from the average high cpi core to migrate to another core of low cpi(threadsset affinity are not migrated).
It's just my idea and i don'know whether it is feasible ?

kewenpan · ‎04-08-2010

thank you ,tim18
I have read some topics on cpu affinity.
I think it is a important method for improving performance and making full use of each core.
how to achieve affinity to improve performance in multi-core architecture will be a question.

jimdempseyatthecove · ‎04-08-2010

The problems you have are the thread cpi will vary depending on the code it is executing and the data upon which it processes combined together with the activities of the other threads on the system. So what ever you calculate for the cpi will differ. Then when you relocate a thread or groups of threads you will experience a temporaryshortperiod of low cpi value while caches get repopulated. Then afterwards the cpi values recalculated are different than before the relocation.When this means is your application needs a relatively long run time (minutes, hours, days...)and your load balancing would have to be spread out to frequencies ofmiliseconds, seconds, ...

As stated in first paragraph, the cpi's calculated under one set of threadassignments most likely will differ for a different set of thread assignments, therefore you might wish to keep a history of configuration verses cpi, then use that history to estimate optimal distribution. After distribution, sample until cpi's level off, then record into history. This is what you would expect the O/S to do when threads are not pinned. Back in the1980's when I designed and wrote a multi-processor operating system these considerations were factored into the scheduling. Apps were not multi-threaded but processes were scheduled and load balanced this way.

Your efforts of tuning may go astray should the load balance change throughout the run of the application. As recommended earlier, if this is a compute bound application, consider using a tasking based threading environment (e.g. TBB) as opposed to a traditional threading environment (pthreads). It is difficult to state a priori as to which technique will be better for a given application.

Jim Dempsey

kewenpan · ‎04-09-2010

Thank you ,Jim Dempsey.
Recently ihavestudied load balance algorithm in multi-core archtitecture.
It'swell known that static load balancingalgorithm will benotadapted for fine-grainedscheduling.
Static load balancing also can not make full use of thread level paralling and not achieve high level of balancingin multi-core architecture,exspecially not adjusting load balancing by specific environment when system load varies.
I want to study heuristic algorithm which can dynamically adjust load in the light of real environment to make load balance on each core .
Hope thateveryone can give me some advice.
thank you for your reading.

robert-reed · ‎04-09-2010

If you're interested in heuristic scheduling algorithms, you might want to take a look at the open source for Intel Threading Building Blocks which, with its task stealing structure and support for task subdivision in some frequently used parallel control constructs like parallel_for and parallel_reduce, offers an example of a dynamically load balancing system. Check out the TBB downloads page for access.

I'm less thrilled by the thought ofbasing process scheduling decisions in the kernel on past "CPI" performance. CPI or Cycles Per Instructions retired is a ratio often used to characterize performance, and almost as often misused. Too many people focus on the ratio itself rather than the components that comprise it. The classic counter-argument is to consider the vectorization of some floating point operations, for example with Intel's SSE instructions. In a typical case you might see the number of cycles required to complete a certain amount of work go down with vectorization, but the number of instructions retired go down even further (which makes sense, replacing individual floating point instructions with vector instructions that handle multiple floats simultaneously), but the net effect might be to see an improvement in performance coupled with an increase in CPI! Using CPI alone as input to a process scheduler seems to be fraught with such complications.

More interesting to me are efforts to look at other inputs to the scheduler.I've heard ofworkbeing done at Berkeley for example, to characterize the cache footprint of various processes and schedule those on the available HW threads based on those footprints: for example, processes with small cache footprint might be scheduled together to share the available cache while another process that is a cache hog would get its own, private HW thread, the idea being to reduce cache thrashing between processes. Another Berkeley scheduling idea is something called Harts and Lithe, which is a prototype for managing processor threads as a system-wide resource. This would get at the heart of load balance in a multi-model threading model, providing one means for example to share process threads between interoperating threading libraries like OpenMP and TBB. (Microsoft is doing its own work in this area, the first fruits of which should be appearing as part of Visual Studio 2010.)

kewenpan · ‎04-12-2010

Thank you,Rober Reed.

Your suggestions are very important for me to further study kernel scheduling.

I am studying on cache footprint for these days and would try to find some methods to charaterize cache

footprint.Maybe i am a little fuzzy about the concept of cache foorprint.

My understanding on cache footprint is as follows:

first, cache footprint is the frequency of accessing cache entry(TLB or L1 data cache) ????

for example ,Assure we access B object. we must fetch the pointer of B in A object before accessing B .

if both A and B are in the same cache page , we can directly find the pointer of B in TLB.The frequency of cache footprint is "1".if not ,the frequency is "2".

if both A and B is in the alignment space,we also can get the same result as the above only by accessing L1 data cache not TLB.

The smaller cache footprint of accessing cache entry in executing a task ,the higher the cache hit ratio ?

The smaller cache footprint of accessing cache entry in executing a task,the less the cache and the main memory interactive ?

The smaller cache footprint of accessing cache entry in executing a task,the faster the task executed ?

the above is my understanding about cache footprint,please correct my error.

I still have some questions which make me confused,please give me some explantation.

first. the sentence "processes with small cache footprint might be scheduled together to share the available cache while another process that is a cache hog would get its own, private HW thread, the idea can reduce cache thrashing between processes."?

what is the mean of cache hog?

It means cache hot or larger cache footprint or other means?

could you take a example about the above ?i don't exactly understand.

thank you for your reading and looking forward your reply.