Re: Application's threads layout -- optimal binding of threads

gallus2 · ‎09-23-2009

I'm looking for general rules of how threads should be assigned (affinitized) to processing units. Even that I can't use any scientific background, I think I can safely state, that when a high performance, multithreaded application is being run on a multicore machine, its threads should be somehow affinitized to machine's cores. In my experience OS's scheduler isn't competent of exploiting such machine's full potential.

Following article:
http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=1430575
investigates relatively simple scenario with 2 cores. Today, we have:
- multicore and multisocket architecture,
- multilevel cache with different levels shared between different cores,
- NUMA.

I'm looking for any clues of how to design and run a high performance application on such complex architectures. Can anyone share?

Tudor · ‎09-23-2009

Quoting - gallus2

I'm looking for general rules of how threads should be assigned (affinitized) to processing units. Even that I can't use any scientific background, I think I can safely state, that when a high performance, multithreaded application is being run on a multicore machine, its threads should be somehow affinitized to machine's cores. In my experience OS's scheduler isn't competent of exploiting such machine's full potential.

Following article:
http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=1430575
investigates relatively simple scenario with 2 cores. Today, we have:
- multicore and multisocket architecture,
- multilevel cache with different levels shared between different cores,
- NUMA.

I'm looking for any clues of how to design and run a high performance application on such complex architectures. Can anyone share?

As far as I know, the OS scheduler "does its best" to keep threads on the same processor when their turn comes to run, in order to exploit the caching of thread data.
You can also set the thread affinity in your application manually.

gallus2 · ‎09-23-2009

Quoting - Tudor

As far as I know, the OS scheduler "does its best" to keep threads on the same processor when their turn comes to run, in order to exploit the caching of thread data.
You can also set the thread affinity in your application manually.

My experiences with OS scheduler are disappointing. You can try it yourself. Seeing thread needing 80% of CPU being assigned to a CPU already busy in 80% is instantly guaranteed. The scheduler behaves like it didn't know what he's doing. AFAIK in the HPC world, setting affinity manually is done on normal day basis.

PS. My tests were quite simple. The scheduler would probably done even worse job when multisocket, multicore, cache sharing issues would be taken into account. I see the scheduler as an incompetent joke for a professional high performance stuff.

PS2.: The OS was Linux kernel ver. ~2.6.18.

Dmitry_Vyukov · ‎09-23-2009

General guidelines are:
- place threads that communicate with each other as close as possible, preferably to cores that share L1 (this reduces communication latency and cost)
- place threads that do independent work as far as possible, preferably to different NUMA nodes (this leaves more resources, particularly memory bandwidth, to each thread)
- place threads that work with the same data to single NUMA node
- place threads that do different kinds of work (integer vs floating-point computations, or computations vs memory accesses) to HT-sibling threads (this allows better resource utilization)

Note that in real-life threads usually do some communication, some independent work, some work with common data, some work with private data, some different kinds of work and some identical kinds of work, all at the same time.

Dmitry_Vyukov · ‎09-23-2009

Quoting - Dmitriy Vyukov

General guidelines are:
- place threads that communicate with each other as close as possible, preferably to cores that share L1 (this reduces communication latency and cost)
- place threads that do independent work as far as possible, preferably to different NUMA nodes (this leaves more resources, particularly memory bandwidth, to each thread)
- place threads that work with the same data to single NUMA node
- place threads that do different kinds of work (integer vs floating-point computations, or computations vs memory accesses) to HT-sibling threads (this allows better resource utilization)

However the better way is to create thread per logical processor, then strictly bind each thread to according logical processor. This way threads effectively represent processors. Then do manual scheduling of lighter-weight tasks/work items on the threads according to above rules. This way you get more flexibility.

TimP · ‎09-23-2009

Quoting - gallus2

My experiences with OS scheduler are disappointing. You can try it yourself. Seeing thread needing 80% of CPU being assigned to a CPU already busy in 80% is instantly guaranteed. The scheduler behaves like it didn't know what he's doing. AFAIK in the HPC world, setting affinity manually is done on normal day basis.

PS. My tests were quite simple. The scheduler would probably done even worse job when multisocket, multicore, cache sharing issues would be taken into account. I see the scheduler as an incompetent joke for a professional high performance stuff.

It's well known that Intel has been betting on Windows 7 and Server 2008 beta 2 for scheduling on Xeon 5500 and multi-core HyperThread. Even with the new Windows schedulers, there appears to be an expectation that thread yield time (KMP_BLOCKTIME in Intel OpenMP) will be set to a high value to maintain affinity, which of course prevents satisfactory use of more threads than cores and may interfere with power saving strategies.
If what you mean by "the OS" is an obsolete version of Windows, yes, that has been a sore point at least since the time of Pentium D. Among linux users, I don't think there's such a high expectation for obsolete kernels to suddenly learn how to optimize for CPUs introduced subsequent to the software design, nor for some kind of magic to substitute for features like taskset and numactl. The increasing support for linux by hardware vendors was in part a result of the development model facilitating earlier software support of new hardware.
Schedulers can't figure out on their own which threads should be assigned to the same cache and memory banks. Thus, OpenMP and MPI implementations need affinity controls such as the Intel implementations provide with KMP_AFFINITY and I_MPI_PIN_DOMAIN. You may have noticed a renewed effort to expand the use of OpenMP beyond HPC.

Dmitry_Vyukov · ‎09-23-2009

Quoting - tim18

It's well known that Intel has been betting on Windows 7 and Server 2008 beta 2 for scheduling on Xeon 5500 and multi-core HyperThread.

Tim,

Doesn't optimal/good scheduling have something to do with what threads actually do (compute, what computations? access memory, what memory? communicate, communicate with who? etc etc)?

IMHO OS scheduler will never be able to figure that out, thus will be limited only to "satisfactory" scheduling.

Well, yes, there is some academic research ongoing out threre related to usage of hardware performance counters to figure out what threads actually do (access what memory, communicate with who). But I do not think that has any relation to real life at least for next decade.

TimP · ‎09-23-2009

Quoting - Dmitriy Vyukov

Tim,

Doesn't optimal/good scheduling have something to do with what threads actually do (compute, what computations? access memory, what memory? communicate, communicate with who? etc etc)?

IMHO OS scheduler will never be able to figure that out, thus will be limited only to "satisfactory" scheduling.

Well, yes, there is some academic research ongoing out threre related to usage of hardware performance counters to figure out what threads actually do (access what memory, communicate with who). But I do not think that has any relation to real life at least for next decade.

Dmitriy,
Yes, that's the point of the affinity environment variables.
The scheduler must be responsible for an effort to spread out the work evenly, in the absence of hints from the application such as the environment variables give, and for preference to resume a thread on the same memory and cache bank. As you say, it can't optimize for data sharing among threads.
One might envision some kind of dynamic optimization according to cache line sharing, for an exceptionally well-behaved application. As you say, there is nothing close to that in the present.
Tim

jimdempseyatthecove · ‎09-27-2009

Quoting - tim18

Dmitriy,
Yes, that's the point of the affinity environment variables.
The scheduler must be responsible for an effort to spread out the work evenly, in the absence of hints from the application such as the environment variables give, and for preference to resume a thread on the same memory and cache bank. As you say, it can't optimize for data sharing among threads.
One might envision some kind of dynamic optimization according to cache line sharing, for an exceptionally well-behaved application. As you say, there is nothing close to that in the present.
Tim

Tim,

First, it was a pleasure to meet you at IDF. unfortunately we did not spend enough time with each other, maybe next year.

From my experienced position, the O/S scheduler can only be cognizant of, or at the grossest ,or coarsest, level affinity requirements of application threads. And, for the time being, having the O/S scheduler redirect threads from core x to core y will (most) always be less efficient than having a task pool system where the application can hint at affinity preference or requirements for the enqueued task. An application writer will have the foreknowledge of data access patterns within an application. With this knowledge, the programmer can then partition the application into tasks, sub-tasks, sub-sub-tasks etc... and if given the right tool, the programmer can provide affinity hints as to how best to distribute the tasks amongst the available cores/HW threads. Giving this level of control to the programmer was a principal design consideration I used in designing QuickThread (www.quickthreadprogramming.com).

With the proper threading tool, it is possible to handle situations such as, and state affinity preferences such as:

"I have just finished a section of code and data is present in my L1 cache, the next task has a preponderance of integer computations, and the loop data requirements are on the order of size of L1, the requirements are such that should my HT sibling be available, that I would prefer to split the task amongst my thread and my HT sibling thread. Otherwise, should the HT sibling task be busy do not split and enqueue a task, let me make the call directly."

The above is rather a large mouthfull of words to state, depending on the availability of my HT sibling either split the domain and enqueue half to my sibling (and call other half directly), OR call all of task directly. Call meaning simple function call (i.e. no task enqueue no task enqueue overhead).

With QuickThread, a undirected (unqualified) parallel_for might look like

parallel_for(0, nBodies, Foo, AllBodies);

Somewhat like the parallel_for in TBB excepting QuickThread doesn't require that funky looking iterator nor require an additional class object with an esecute() function.

Whereas the depending on availability of HT sibling, parallel_for would look like

parallel_for(Waiting_L1$, 0, nBodies, Foo, AllBodies);

The associativity and dissasociativity qualifications of desired task affinity relationships has been distilled into a single (optional) attribute that can prefex (and qualify) the parallel_... templates.

As you would expect, HW thread selecting/directing includes

L0$ = To self
L1$ = Include all threads sharing my L1 cache (HT siblings if present)
L2$ = Include all threads sharing my L2 (usually 2 or 4 with HT threads)
L3$ = Include all thread sharing my L3 (when L3 present, else sharing L2)
M0$ = Include all threads sharing my NUMA node
M1$ = Include all thread sharing my NUMA node plus all threads 1 NUMA hop away
M2$ = ... 2 NUMA hops
M3$ = ... 3 NUMA hops

As seen in the parallel_for example, you can modify (flag) the qualifier with:

Waiting_ to select based on the opportunity of availability of threads.

The Ln$ and Waiting_Ln$ are used for "Hot in cache" situations where the current task has just completed some work, and the next phase (task) is to perform additional work where all workers can take advantage of the data your task brought into cache.

There are other situations where the programmer will know that the data for the next task is not likely to be in a cache or where the caching requirements are such that you are willing to suffer a single memory fetch at the benefit of having a larger team sharing a particular cache level. For this the programmer can use

parallel_for(NotInCache_L3$, 0, nBodies, Foo, AllBodies);

To indicate the data for the slices of Foo is not expected to currently reside in a cache, and it would be beneficial to run the task Foo with a team of threads using a single L3 cache and within the socket containing the most available threads sharing the same L3 cache. And further you want to schedule all threads within the selected L3 in the event that the busy thread(s) within that L3 will finish soon and be available to take on your nexttask.

Now should you want to exclude busy threads in the socket with the system L3 with most available threads you can do that too

parallel_for(Waiting_NotInCache_L3$, 0, nBodies, Foo, AllBodies);

As Dmitriy stated to the effect, sometimes the thread team works best when the data is shared within a cache (level). And he stated to the effect that some tasks will work best with a team operating in separate cache levels.

An example of distribution of work amongst seperate caches would be

parallel_for(OneEach_L2$, 0, nBodies, Foo, AllBodies);

The above will distribute slices of the array (container) AllBodies to function Fooas onetask scheduled per L2 cache on the system. On a system with 2 processors, each with 4 cores, 2 L2 cache/processor and each core with HT...
The domain of the parallel_for (0:nBodies-1) is split 4-ways (one for each L2 cache on the system), 3 tasks are enqueued and one direct call is made to Foo (with one of the 4-way splits). Each of the 3-task enqueues will have a preference of any one of the 4-HW threads sharingthe respectiveL2 cache. And in the event that none of the HW threads can get to the task in time, then threads sharing other L2 caches will be permitted to cross-over and task-steal the dangling task.

QuickThreads provides a simple programming technique to specify work closely or work apart.

The selection capability of QuickThreadincludes additional qualifiers. Such as

Excludecurrent thread from slicing, exclude current thread andall threads sharing current thread's L2 cache, etc.
Enqueue to an I/O class thread as opposed to a compute class thread.
Enqueue tasks in explicitly FIFO manner (as opposed to defaultquazi-LIFO)
Enqueue task only upon completion of other task(s)
Enqueue task to specific thread number
and other unusual selection requirements.

You might think that with all of this selection capability that it would overburden the task scheduler.
Surprisingly, this is not the case. Task enqueuing in QuickThread is surprisingly light-weight and more often than not shows superior performance when compared to other task enqueuing toolkits.

Information on QuickThreads can be found at www.quickthreadprogramming.com

Jim Dempsey

gallus2 · ‎09-28-2009

Quoting - jimdempseyatthecove

L0$ = To self
L1$ = Include all threads sharing my L1 cache (HT siblings if present)
L2$ = Include all threads sharing my L2 (usually 2 or 4 with HT threads)
L3$ = Include all thread sharing my L3 (when L3 present, else sharing L2)
M0$ = Include all threads sharing my NUMA node
M1$ = Include all thread sharing my NUMA node plus all threads 1 NUMA hop away
M2$ = ... 2 NUMA hops
M3$ = ... 3 NUMA hops

Interesting library. I'm not a professional which opinion could be taken into significant account. However, I didn't see any library besides QTP which would so bravely face the reality that we are currently in. This reality is best described by above enumeration L0$..M3$.

Khaled_Abbad · ‎09-28-2009

hi all I'm new her but i will try to be active member her

many thanks all

gallus2 · ‎09-29-2009

Is there a library similar to QTP (in area of the memory and cache layout issues) for Java?

Dmitry_Vyukov · ‎09-29-2009

Quoting - gallus2

Is there a library similar to QTP (in area of the memory and cache layout issues) for Java?

To the best of my knowledge, there is no such libraries for Java. Actually there is no such libraries for C/C++ too, Jim's QT must be the first.

Java's model is that you have infinite number of processors, topology is completely uniform, and no caches. So it's basically impossible to create such a library.

gallus2 · ‎09-29-2009

Quoting - Dmitriy Vyukov

To the best of my knowledge, there is no such libraries for Java. Actually there is no such libraries for C/C++ too, Jim's QT must be the first.

Java's model is that you have infinite number of processors, topology is completely uniform, and no caches. So it's basically impossible to create such a library.

The false sharing issue is also being investigated in Java world, for example here. Maybe they would also benefit from more hardware-aware library.

PS. Isn't GCD the most-general and final solution for this kind of problems?

Dmitry_Vyukov · ‎09-30-2009

Quoting - gallus2

PS. Isn't GCD the most-general and final solution for this kind of problems?

Does GCD offer anything new at all? I see nothing for now...

Application's threads layout -- optimal binding of threads to CPUs (cores)