Intel® Moderncode for Parallel Architectures
Support for developing parallel programming applications on Intel® Architecture.
1696 Discussions

how to distinguish threads which belong to different tasks?

kewenpan
Beginner
453 Views

we must migrate threads of busiest core to other core when load balancing.

which threads are migrated to other core ?

if we want to have a high cache hit ratio,we should migrate those threads which are not related to most of other threads.

but we how to characterize threads or how to know one thread which is not related to others?

0 Kudos
5 Replies
Dmitry_Vyukov
Valued Contributor I
453 Views
If you are talking about an OS, then I believe current policies are mostly random, i.e. migrate random runnable thread from busiest or a random core to an idle core.

If you are talking about your own scheduling library, then just ask a user for dependencies between threads. There are no other sane methods for that.
0 Kudos
kewenpan
Beginner
453 Views

I talk about an OS.

It is true that current policies should be random to change threads to migrate to other core or idle core.

Are there some good policies to schedule threads when OS happens load balance ?

0 Kudos
jimdempseyatthecove
Honored Contributor III
453 Views
Are you speaking from the frame of reference of

1) Your software is and operating system
or
2) Your software ismulti-process, with each process single-threaded
or
3) Your software ismulti-process, with each processmulti-threaded
or
4) Your software is single-process and multi-threaded

If your application is an MPI-like application then your situation is likely 3). And in which case, is your system one SMP "box" or multiple "boxes" linked via some means (Fiber, Ethernet, etc...)

If your application is pthreads/Win32Threads/OpenMP/TBB/Cilk/etc...-like then your situation is likely 4). And in which case you are running on one SMP "box".

Assuming your application is 4)

Then, 4.a) is your system running additional user applications?
Or 4.b) isyour system dedicated to running your application?
Or 4.c) some blend of4.a and 4.b that is beyond your control?

Assuming the simple case 4.b) (this is the circumstance in which you have the most control)

Are you
4.b.a) using affinity pinning
or
4.b.b)not using affinity pinning?

Are you
4.b.?.a) Oversubscribing threads? (typical for pthreads/Win32 threads)
or
4.b.?.b) Subscribing one thread per hardware thread? (Typical for OpenMP or TBB or Cilk)

From you prior posts, Iassume you are oversubscribing threads.(pthreads type of app)

Assumingoversubscription (pthreads type)

What means (API)do you have to determine what cores are under-utilized?
Can you read the null thread (or idle process) run-time _for_that_core?
Can you measure the temperature _for_that_core?

What means (API) and effort (programming) are you willing to instrument your code to use the debug capabilities of the processor to read the cach hit/miss statistics?

Note, the time to instrument your app will generally exceed the time to run your app under test using VTune.

Once you have the capability of reading the cache hit/miss statistics, then you can run permutation of your thread pinnings and measure the effect on the cache hit/miss ratios. If you have a large number of threads it will be unworkable to run through all the permutations. In this case you willwant to look at favorable and unfavorablepairings: i.e. what two threadssee the best improvement from sharing the same cache, and what two threads see the worst effect from sharing the same cache.

You can do this by selecting running through selections of two threads (those two threads, and all theother threads). Pin all the other threads to one core. Pin the two threads under test to two hardware threads within:

a) L1 ofa different socket if available
b) Different L1 of L2 of a different socket if available
c) Different L2 of L3 of a different socket if available
d) L1 of a different die within your one socket if available
e) Different L1 of L2 of a different die within your one socket if available
f) L1 of different L2 in same socket

After you get the timing for that pair of threads, select another pairing
Once you find the two threads with the best pairing (and the pairing placement), thenpin those two threads back to the same core as theall other threads, exclude them fromfurther pairing, then repeat the favorable pairing measurements for the remaining threads, repeat again with next remaining set, until done.

Note, you are still not finished.

Now that you have the pairs, and their placement restrictions.

Repeat the above process using the pairs of threads, as you had done with the individual threads.

Then repeat on the quads of threads,

Then ...

After this is all done, youwill have a reasonably close approximation to the best thread placement - but you have not tested all permutations.

As you see, thisis a lot of work.

If you let the O/S do the scheduling with un-pinned threads, it can either benefit - or harm your performance.

When yourapplication varies its behavior over time, you now have increased the complexity of the problem.

Jim Dempsey

0 Kudos
TimP
Honored Contributor III
453 Views
This is the purpose of user set affinities such as KMP_AFFINITY, which may be implemented via an OS scheduler facility such as taskset. Users would be responsible for avoiding affinity conflicts among jobs and leaving enough cores open for single thread jobs.
0 Kudos
Dmitry_Vyukov
Valued Contributor I
453 Views
Quoting kewenpan

I talk about an OS.

It is true that current policies should be random to change threads to migrate to other core or idle core.

Are there some good policies to schedule threads when OS happens load balance ?

OS has little info to do well-founded scheduling decisions. At most it can assume that threads of a process are dependent. Techniques to extract more precise info are inaccurate and/or costly, so AFAIK it's not worth it.

If a user wants really smart scheduling, he must take scheduling into his own hands, i.e. do it manually in user space.

At least it's current state of the art.

0 Kudos
Reply