how to simulate multi-core scheduling?

kewenpan · ‎04-07-2010

how to simulate multi-core scheduling ?

where could i find some information on multi-core simulation or simulator .

how to use multi-core simulator to workon multi-core experiment ?

thank you for your reading and looking forward to your advice.

Dmitry_Vyukov · ‎04-07-2010

It depends on what aspects you want to simulate.
You can see one example here:
http://groups.google.com/group/relacy
It's a simulator for a multi-core relaxed memory model and aggressive instruction interleaving.

kewenpan · ‎04-07-2010

i want to simulate change of throughput byinstructionexecution order(instruction mix).
i want to verificate whether the change of instruction execution order will make a improvement on throughtput in multi-core archtiture.
thank you ,Dmitriy Vyukov.

Dmitry_Vyukov · ‎04-07-2010

Ah, I see, that's not the type of simulation I've done. I believe processor vendors and academia researches use the simulators you are talking about.
Try this:
http://lmgtfy.com/?q=processor+simulator

kewenpan · ‎04-07-2010

i am also curious about the view that performance could be improved by instruction mix.
however , i would like to ask for you a question .
cpi could reflect instruction delay in some degree.i can serve it as the first approximation for scheduling decision.therotically,this idea is feasible ?
i want to assign the threads to different core by cpi.
i think there must be different threads of high cpi and low cpi in one core,not all the high cpi in one core.
all threads of high cpi could take up long delay ,so that the throughput of core will reduce.
i am not sure if it is feasible and how to simulator this idea?
have you some other good advice on improving performance in multi-core archtiteture.
thank you, Dmitriy Vyukov.

gaston-hillar · ‎04-07-2010

Hi kewenpan,

What's your goal with the simulation?
Are you trying to determine the best scheduler decisions?
Are you working with a specific programming language?

I think that if you provide more information about your objective, it will help people reading the forum to be more accurate.

Cheers,

jimdempseyatthecove · ‎04-07-2010

>>i want to assign the threads to different core by cpi.

CPI on multi-core, outside of a well controlled environment, such as a emulator, is a very slippery number. Meaning it is very hard to pin down. The reason being the activities on one hardware thread presents phase varying interference. As an example of this consider one core with HyperThreading each thread running the same code but on different data.

run-1
0abcdefghijklmnopqrstuvwxyz
1 abcdefghijklmnopqrstuvwxyz

versis

run-2
0abcdefghijklmnopqrstuvwxyz
1 ..abcdefghijklmnopqrstuvwxyz

In run-1 were both threads are synchronized, the code may run slower than in run-2 where the code on thread-1 has time delay skew of ".."

Depending on the code and data, the above small skew can make a significant difference in execution time and thus CPI. IOW the CPI may vary significantly due to events beyond your control.

In non-HT situations you have similar situations with shared cache where the interference can help and/or hinder the threads sharing the cache.

In non-cache sharing situations (e.g. separate L2 cache per core) memory bus interaction can be affected greatly by small skew difference in threads.

I recently read a /. (slash dot) reference to an article
(http://www.google.com/patents/about?id=vYLJAAAAEBAJ&dq=refactoring+software)

Where IBM filed a patent on "A method for developing a computer program product includes: evaluating one or more refactoring actions to determine a performance attribute; associating the performance attribute with a refactoring action used in computer code; and undoing the refactoring action of the computer code based on the..."

To me, this is an invention already in the public domain as it is a descriptionvariation on what genetic programming (algorithms) does. In genetic programming you use refactoring to produce code for a solution by randomly perturbing instruction sequences until best solution is encountered. Best == correct .and. fastest. It is possible that this patent could be challenged based on prior art.

Setting aside the IBM patent issue, this automated way of refactoring instruction sequences could yield interesting results in performance improvements - assuming you have the computational resources and time to derived the best/better solution.

Jim Dempsey

kewenpan · ‎04-07-2010

thank you ,Jim Dempsey.
i have further learned on the instruction sequence from your reply.
if i have a multi-core archtiture with four core,each core have four threads.these threads have a cpi value ,such as 1,4,8,12 .Now these are the followed situations to be talked:
threadsare assignedby cpi value.if we assign threadsby different combination of cpi value,the performance gap will be large.
core 1 core2 core3 core4
a) (1,4,8,12) (1,4,8,12) (1,4,8,12) (1,4,8,12)

b) (1,1,4,8) (1,4,4,8) (1,8,8,12) (1,4,12,12)

c) (1,1,1,4) (1,4,4,4) (8,8,8,8) (12,12,12,12)

d) (1,1,1,1) (4,4,4,4) (8,8,8,8) (12,12,12,12)

i have readsimilary experiment from some article.
Threads will be assigned to each core according to cpi value andthreads of highand low cpi will puttogether as much aspossible.high cpi couldhave low pipeline requriements ,they spend much of their time blocked on memory or long -latency instructions.so thatmany funcitonal unitsare notbe madeno use .low cpi have high pipeline requriements.so the resource will be enough ultilizated in combination of high and low cpi.
Throughput of a) and b) will much larger than c) and d).

i want to find some new methods to assign threads to improve cpu performance.
i am not sure that cpi combination is a good method to improve performance.beacuse "cpi on multi-coer is hard to pin down"said Jim Dempsey,i think also.
maybe it is very hard to find a viable metric to assign threads in scheduler.
recently i have studied this aspet on scheduling in multi-core artiture.
hope that everone give me some advice on scheduing policies.
thank you for your reading and looking forward to your advices.

jimdempseyatthecove · ‎04-08-2010

Unless I misunderstand your description, I interpret your system description as

1 Processor
4 cores in processor
no Hyper Threading
4 software threads assigned to each core (through affinity pinning)

Please correct this mis-interpretation should your system have

4 Processors
4 cores in each processor (16 cores total)
no Hyper Threading

Assuming 1 processor scenario:

What you are describing is an over subscription of threads situation (multiple threads per core). In a system with multiple processes (applications) you will have more threads than cores and your scheduling options are somewhat beyond your control (inside the domain of the O/S thread scheduler).

In a system which is principly running one application, you, as the programmer, do have control over thread scheduling. And you have control over threading model: many threads or tasking.

You have described a many threads configuration where each core will run but one thread at a time. Threads being time-slices and/or switched as a result of I/O delay. When the majority of work to be performed is compute intensive then the many threads per core suffers from incurring significant overhead in thread context switching (at time-slice switch).

You can reduce this overhead by using a task model instead of many threads model. In a tasking system, the thread itself, performs a relatively low overhead switch from one task to the next.

Threading Building Blocks (TBB) and QuickThread (my product) are tasking systems. I recommend you consider a tasking system to improve load balance and reduce thread context switch overhead.

Because the task to task overhead is lower than thread context switch you can make a finer grains out of your tasks as well as use parallel loop constructs that are not available in a single thread per functional group situation.

The advice for how to best parallelize your programming challenge is dependent on your applications specific behavior. The experienced people responding to your queries can only provide you with general advice without knowing more about your application and system. If you have a pressing need, one of us would be available for consultation.

Jim Dempsey

kewenpan · ‎04-09-2010

thank you ,Jim Dempsey
Your reply is exactly perfect.
My expected system is the followed :
1 Processor
4 cores in processor
Hyper Threading
4 software threads assigned to each core (through affinity pinning)
but it's not much more important than your perfect explantation.
maybe i am a newer andlack of better understand about multi-core.iassure your advice will be my best power forward.
Iwill gradually know more about task model and thread model in future study.
Iknow little about this aspect.
Could yourecommend some material or book about task model and thread model ?

jimdempseyatthecove · ‎04-09-2010

I am sorry to say that I am not an avid reader of books and technical articles - I am more of a sponteaneous code generation type of person. I suggest you try to solicit a list of good books from Dmitriy Vyukov, as I suspect he is one of the mostavid readers of books on this subject. You will see Dmitriy on most of these forums and he may have a list of good books and articles for you to read.

Clay Breshears has a good book on parallel programming titled "The Art of Concurrency", published by O'Reilly. (http://oreilly.com/catalog/9780596521530/). And check out James Reinders book "Intel Threading Building Blocks", also O'Reilly. Although I haven't read Clay's book, the index indicates it is broader in subject material than Reinders, as Reinders is structured as a tutorial for TBB.

Another thing to consider is what is your programming environment? Are you programming on a server with requirements of running other applications? Are you on a powerful workstation dedicated to your use? Ar you on a general purpose desktop or notebook? Will the end application run on just your system or on a broad spectrum of systems? Will you have access to acceleration hardware (e.g. GPGPU)? Many of these factors will affect your approach to efficientprogramming.

My general suggestsions (in addition to reading a few select books) are:

1) Learn how to code you serial (single threaded) program to be a efficient as possible, it must be correct, and it ought to be fast.

2) Examine the serial code to see if it can take better advantage of the SIMD instruction sets.

3) Next, look at the program and data flow. There are two gross catagories of programs: a) those that read through a long list of objects (files, records) processing each object once then write out results. And b) those that tend to have group of objects that are iterated upon. a) might favor a parallel pipeline type of archetecture, and b) might favor parallel loops. Each application is different, and you will find contrary results.

Only after you progress through 1, 2, 3 should you then consider looking at augmenting the thread scheduling.

Jim Dempsey