we must migrate threads of busiest core to other core when load balancing.
which threads are migrated to other core ?
if we want to have a high cache hit ratio,we should migrate those threads which are not related to most of other threads.
but we how to characterize threads or how to know one thread which is not related to others?
If you are talking about your own scheduling library, then just ask a user for dependencies between threads. There are no other sane methods for that.
I talk about an OS.
It is true that current policies should be random to change threads to migrate to other core or idle core.
Are there some good policies to schedule threads when OS happens load balance ?
1) Your software is and operating system
2) Your software ismulti-process, with each process single-threaded
3) Your software ismulti-process, with each processmulti-threaded
4) Your software is single-process and multi-threaded
If your application is an MPI-like application then your situation is likely 3). And in which case, is your system one SMP "box" or multiple "boxes" linked via some means (Fiber, Ethernet, etc...)
If your application is pthreads/Win32Threads/OpenMP/TBB/Cilk/etc...-like then your situation is likely 4). And in which case you are running on one SMP "box".
Assuming your application is 4)
Then, 4.a) is your system running additional user applications?
Or 4.b) isyour system dedicated to running your application?
Or 4.c) some blend of4.a and 4.b that is beyond your control?
Assuming the simple case 4.b) (this is the circumstance in which you have the most control)
4.b.a) using affinity pinning
4.b.b)not using affinity pinning?
4.b.?.a) Oversubscribing threads? (typical for pthreads/Win32 threads)
4.b.?.b) Subscribing one thread per hardware thread? (Typical for OpenMP or TBB or Cilk)
From you prior posts, Iassume you are oversubscribing threads.(pthreads type of app)
Assumingoversubscription (pthreads type)
What means (API)do you have to determine what cores are under-utilized?
Can you read the null thread (or idle process) run-time _for_that_core?
Can you measure the temperature _for_that_core?
What means (API) and effort (programming) are you willing to instrument your code to use the debug capabilities of the processor to read the cach hit/miss statistics?
Note, the time to instrument your app will generally exceed the time to run your app under test using VTune.
Once you have the capability of reading the cache hit/miss statistics, then you can run permutation of your thread pinnings and measure the effect on the cache hit/miss ratios. If you have a large number of threads it will be unworkable to run through all the permutations. In this case you willwant to look at favorable and unfavorablepairings: i.e. what two threadssee the best improvement from sharing the same cache, and what two threads see the worst effect from sharing the same cache.
You can do this by selecting running through selections of two threads (those two threads, and all theother threads). Pin all the other threads to one core. Pin the two threads under test to two hardware threads within:
a) L1 ofa different socket if available
b) Different L1 of L2 of a different socket if available
c) Different L2 of L3 of a different socket if available
d) L1 of a different die within your one socket if available
e) Different L1 of L2 of a different die within your one socket if available
f) L1 of different L2 in same socket
After you get the timing for that pair of threads, select another pairing
Once you find the two threads with the best pairing (and the pairing placement), thenpin those two threads back to the same core as theall other threads, exclude them fromfurther pairing, then repeat the favorable pairing measurements for the remaining threads, repeat again with next remaining set, until done.
Note, you are still not finished.
Now that you have the pairs, and their placement restrictions.
Repeat the above process using the pairs of threads, as you had done with the individual threads.
Then repeat on the quads of threads,
After this is all done, youwill have a reasonably close approximation to the best thread placement - but you have not tested all permutations.
As you see, thisis a lot of work.
If you let the O/S do the scheduling with un-pinned threads, it can either benefit - or harm your performance.
When yourapplication varies its behavior over time, you now have increased the complexity of the problem.
OS has little info to do well-founded scheduling decisions. At most it can assume that threads of a process are dependent. Techniques to extract more precise info are inaccurate and/or costly, so AFAIK it's not worth it.
If a user wants really smart scheduling, he must take scheduling into his own hands, i.e. do it manually in user space.
At least it's current state of the art.