- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
No, TBB scheduler is not NUMA aware, it uses Cilk-style randomized work-stealing. Randomized work-stealing does not care about NUMA, however provides some theoretical properties regarding time and space complexity of parallel computation.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
http://software.intel.com/en-us/blogs/2009/01/21/affinities-and-opportunistic-thread-scheduling/
http://www.quickthreadprogramming.com/
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
No, TBB scheduler is not NUMA aware, it uses Cilk-style randomized work-stealing. Randomized work-stealing does not care about NUMA, however provides some theoretical properties regarding time and space complexity of parallel computation.
Hi, thanks for the tip about the Quick Thread Library. I'm reading the white paper and the manual, and will get back to you. However, I still find it unsettling thatTBB doesnt take NUMA architectures into account. I suppose that trying to steal from the same NUMA node first will lead to a reduction in the set of victim threads and might lead to a penalty incurred in the form of increased unsuccessful steals, but that is largely problem dependent in that the occupancy of queues and the skew between different nodes at any point of time vary from problem to problem, and the choice can be left to the user. Also, what kind of theoretical bounds does random stealing provide?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Regarding QuickThread, you may address directly to Jim Dempsey, you can find his email on http://www.quickthreadprogramming.com
Regarding NUMA and random stealing, I think I will not be able to retell all that stuff, so please refer to "Space-Efficient Scheduling of Multithreaded Computations":
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.44.6193
and other Cilk papers:
http://supertech.csail.mit.edu/papers.html
I agree with you that NUMA awareness is indeed of help for efficient scheduling, and not only NUMA awareness but also HyperThreading awareness, shared cache awareness, etc. Actually I incline towards the opinion that the only good thing in randomized stealing is that it allows one to make statements like "Each thread is located independently at random, and hence, the random variable Zi has a binomial distribution with P lg P trials and success probability 1/P". IMVHO more practical scheduler will tend to be locality/NUMA/cache/HT aware, and apply some supervision on top of that to prevent possibility of very bad executions, so to say. I.e. randomized scheduler penalizes all cases in the name of theoretical properties. And practical scheduler must preserve locality in common case, and preserve theoretical properties by some episodic low-overhead activity.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Interesting... If you don't mind a naive question (without first having read the available materials), what would you say to a heuristic that would first examine possible work from "close" threads in the hardware sense (cache layout), but limit how "far" the work can be in the software sense (task depth difference)? If no work is available from a random victim in the first candidate group within the eligible-depth window, the algorithm would examine another random victim but from both an enlarged candidate group and depth window. This may very well require experimentation and tuning (made more difficult because it would likely depend on what code is being executed), but some applications may warrant such an effort.
(Added) If it has been addressed in a reference, juststating thatis OK for me.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Interesting... If you don't mind a naive question (without first having read the available materials), what would you say to a heuristic that would first examine possible work from "close" threads in the hardware sense (cache layout), but limit how "far" the work can be in the software sense (task depth difference)? If no work is available from a random victim in the first candidate group within the eligible-depth window, the algorithm would examine another random victim but from both an enlarged candidate group and depth window. This may very well require experimentation and tuning (made more difficult because it would likely depend on what code is being executed), but some applications may warrant such an effort.
(Added) If it has been addressed in a reference, juststating thatis OK for me.
Raf,
this is one of the things that Keith Randall tried. He also tried things of the form: steal from ``close'' workers with probability X, and from ``far'' workers with probability (1-X). Tweak X. None of these attempts made much difference. He has a few comments at the end of his thesis, but academia being what it is, you cannot publish things that don't work, so everybody has to repeat the same mistakes all over again :-)
There are two fundamental problems, I think, that any scheme of this kind will incur. The first is that whatever work is in a queue close to you is not necessarily using data that is close to you, so there is no reason why stealing close work would be a good idea (except for reducing the cost of stealing itself). The second is that, even assuming that there is a correlation between the location of the work and the location of the data, this correlation is broken as soon as you perform one remote steal. Assume that worker W manages to steal a remote chunk of work. From that point on, every worker closer to W will prefer stealing from W, and will steal exactly the wrong thing (because W's work was using remote data to begin with).
I suspect that there is a thermodynamic analogy at play here, although I have no proof. If you drop ink in water, the ink diffuses as a brownian motion, which is in the limit mathematically equivalent to a randomized work-stealing from the nearest neighbors in a 3D grid. Over time, the ink diffuses into a maximum-entropy distribution that is indistinguishable from the distribution that you would get by allowing Cilk-style stealing of ``work molecules'' (this would happen with ink molecules of infinite velocity). So the system always converges with the same equilibrium distribution; the only difference is that randomized work stealing converges to the equilibrium faster.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Ah?
"whatever work is in a queue close to you is not necessarily using data that is close to you"
Hmm...
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Matteo,
Well, I can't say that there is a practical way to produce a scheduler that will take advantage of NUMA, because I did not create one... ah, and not even tried to create one.
My key point is - how can *randomized* scheduler ever ensure that/anything? It may do only not too bad, but it can't do too good either.
In what way? Random scheduler equally prefers small chunks. I would say that it actually tends to steal small chunks of work (I guess there are more of them), but also sometimes "with probability 1/P" steals big chunks too. If one wants to steal big chunks -> steal big chunks, do not steal at random. What I am missing here?
Indeed. If I have 2 NUMA nodes and need to sort 2 independent arrays, I would prefer to dedicate first NUMA node to first arary, and second NUMA node to second array, and not to blend everything.
Regarding practical ways. What do you think on following?
For Hyperthreading (somehow applies to shared caches too): always try to steal from HT-sibling in FIFO order first. HT-siblings do not have enough resources to work on separate tasks, so it's better for them to help each other with single task. Steals for HT-siblings are naturally costless, so more frequent steals won't harm.
Locality-aware stealing: steal in the following order: (1) prefer to steal from nearest victim with a lot of work, (2) then victim with a lot of work, (3) then nearest victim, (4) then everything else. I.e. stealing is based on distance/amount_of_work metric. Locality is good, stealing of big chunks is good, right? So why not prefer that?
Work-first vs Help-first. Try to balance between Work-first vs Help-first scheduling for the following code (your example):
while (something)
cilk_spawn foo();
I.e. try to steal even more work in specific situations.
Actually "smart" scheduler may introduce some randomness too to not behave too bad in unexpected situations. For example, it can do "non-random" scheduling on even steals and do "random" scheduling on odd steals; that way probability of something good/bad is just 1/2P instead of 1/P for pure random scheduler; but who cares, it's just as we have 2 times more processors.
p.s. I still remember about the following conversation:
http://software.intel.com/en-us/forums/showthread.php?t=69681
Just do not have much time right now... and it's quite difficult to reverse-engineer assembly (I've intercepted Cilk++ "preprocessor" intermediate output file, it helps somehow).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It's not surprising that it did not make much difference if he tested on Fibbonacci program :)
Well, there are 1000 and 1 way how user may compromise performance, runtime scheduler will never win in this game.
However runtime scheduler may provide good performance for good programs. Doesn't you require busy-leaves, etc? For matrix-multiplication, merge-sort and all that conquer-and-divide stuff close queue means close data. I think it's worth providing better performance for at least those programs.
After remote steal stolen remote data becomes local data. So every worker close to W will do right thing rather than wrong. What else it can do? It can't steal anything local (there is no local since W did remote steal). It can only steal even more remote work and cause even more misses, is it right thing?
Since W already stolen remote piece of work X, W will have to fetch whole X sooner or later. So if workers close to W will help in that they do not increase number of misses in any way (they do increase if they will do something else).
Well, ok, if we have following distribution of work:
02111111111120
What is better:
1. Fill first '0' by splitting second '2'; and fill last '0' by splitting last but one '2'.
2. Do vice versa
?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Raf,
this is one of the things that Keith Randall tried. He also tried things of the form: steal from ``close'' workers with probability X, and from ``far'' workers with probability (1-X). Tweak X. None of these attempts made much difference. He has a few comments at the end of his thesis, but academia being what it is, you cannot publish things that don't work, so everybody has to repeat the same mistakes all over again :-)
There are two fundamental problems, I think, that any scheme of this kind will incur. The first is that whatever work is in a queue close to you is not necessarily using data that is close to you, so there is no reason why stealing close work would be a good idea (except for reducing the cost of stealing itself).
The second is that, even assuming that there is a correlation between the location of the work and the location of the data, this correlation is broken as soon as you perform one remote steal. Assume that worker W manages to steal a remote chunk of work. From that point on, every worker closer to W will prefer stealing from W, and will steal exactly the wrong thing (because W's work was using remote data to begin with).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
"If you drop ink in water, the ink diffuses as a brownian motion, which is in the limit mathematically equivalent to a randomized work-stealing from the nearest neighbors in a 3D grid."
It's not really ink so much as grains. And if you process grains faster in a localised setting (hyperthreaded core, cores sharing a cache) perhaps they locally disappear more quickly, counteracting the spread.
"So the system always converges with the same equilibrium distribution; the only difference is that randomized work stealing converges to the equilibrium faster."
But how do you know that that is not a negative thing...
I really should read some of that study material, though. Meanwhile, perhaps somebody who has actually tried to implement and exploit a NUMA-aware multithreading toolkit might decide to chime in?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Gera,
A distinction has to be made herebetween a numerical probability derived from the assumption of random selection within a subset and the deterministic selection within a scheduler. Depending on the circumstances (tasks in queue) the numerical probability may be significantly less than 1 whereas the deterministic selection within a scheduler may always yield 1.
When a programmer is given the means to provide an indication of local preference for task(s) (don't care, preferred set, required set) as well as control over task teams, they can thenpursue a coordinated effort at producing a solution.
The trick in doing this is in providing a low overhead means of performing the core and NUMA node selections. When I designed this capability into the QuickThread scheduler special attention was given to making the overhead as low as possible. Most of the overhead consists of a table lookup and write of a bitmask. Some table lookups take more time than others.
In the case where no preference is specified (e.g. parallel_for) the team selection is similar to TBB/Cilk: all threads, however QT has a bias towards current thread (thread issuing parallel_for). When you schedule under "Hot in Cache" situations you would specify a team based upon those sharing your thread's specified cache level (L1, L2, L3) this is a table lookup for bitmask. For "Not in Cache" you might want to direct the scheduler to choose an L3 with the most idle threads. This incurs a little more overhead but with the hope of payback with higher cache hits.Or you may have the occasion for each object to contain a preferred or required thread selection. Or certain task may be restricted to particular cores (e.g. task driving GPU).
If you or Matteo have an example program that _should_ exhibit improvement with cache and NUMA aware scheduling (but has not due to lack of threading tool) I would appreciate you contacting me. A good sample program would be nice. But good sample programs are hard to come by. The sample program has to do real work.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Jim, and what do you think regarding possibility of locality/NUMA-aware scheduling *without* any hints from a programmer? Definitely it will be less efficient than a programmer-driven scheduling (like in QuickThread), but is there something runtime scheduler can take advantage of automatically? Or blind random scheduling is the best practical solution?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Jim, and what do you think regarding possibility of locality/NUMA-aware scheduling *without* any hints from a programmer? Definitely it will be less efficient than a programmer-driven scheduling (like in QuickThread), but is there something runtime scheduler can take advantage of automatically? Or blind random scheduling is the best practical solution?
When operating *without* any hints from a programmersuch as what is available with TBB, Cilk, OpenMP, then the designers of those schedulers can take either a "don't care" method .OR. take a method that has some potential for benefit on NUMA platform. I haven't examined closely what TBB does, but I think the effect will be something along the line of when affinity pinning is in effect, parallel constructs tend to favor closer threads first.
While this technique works well in the inner most loops, the outer most loops (when you havemulti-levels of nesting and multi-socket/nodes) may tend to interfere with your cache utilization. In some cases you will want to scatter the outer loops amongst sockets/nodes but schedule the inner most loops within the socket/node. When you have no means as a programmer as to your preference (TBB/Cilk), the task scheduler has only one selection criteria (don't care or nearest(/farthest)). The programmer though will have apriori knowledge of the data access patterns and should be able to supply a reasonable hint as to the most effective task scheduling strategy provided they have a means of specifying this hint (e.g. QuickThread).
Although the following is not diffinative:
http://www.quickthreadprogramming.com/Comparative%20analysis%20between%20QuickThread%20and%20Intel%20Threading%20Building%20Blocks%20009.pdf
page 25, "Throughput of pipeline"
You can see an example ofthe QuickThreadparallel_pipeline run on a Dell R710 dual socket Xeon 5570NUMA class system with RAID10 disk controller. (unfortunately I cannot paste the chart in here)
The QuickThreadparallel_pipeline is NUMA aware. The default settings are to allocate 4 buffers per hardware thread distributed across availableNUMA nodes. 2 additional buffers are allocated as a provision for 2 additional I/O threads. This system has 16 HW threads (2 x 4-core with HT). This QT pipeline therefore runs with 18 threads. The 2 I/O threads (one for input end and one for output end of pipeline)are not affinity pinned, the 16 compute threads are affinity pinned. When the input end (pipe)of the pipeline receives a buffer it may come from any NUMA node (2 on this system). When the read is done, the buffer is passed from the I/O class thread to a compute class thread with a preference to be run on the NUMA nodefrom which it was allocated.
The test program is relatively simple in that it performs a sequential read of a file, up-casing words, while writing results out to an output file. The file was about 659MB in size. The particular setup scheduled adjacent HW threads (e.g. core, core's HT sibling, next core,...) although it would have been relatively easy to schedule core, core, core then back fill with HT siblings.
The scaling was linear upto the point where the compute threads and I/O threads were competing for HW thread resource (a little dip from linear scaling). The slope was ~linear but not 1:1 as there appears to be a fixed overhead per I/O request to the operating system (Windows Server x64 2008).
When all 16 compute class threads were running the I/O bandwidth was about 1.65GB/s meaning the computation throughput was also ~1.65GB/s (fewer writes than reads but each write within same 8-byte word). A total application memory throughput of about 3.3GB/s.
Unfortunately I did not have access to a 4 socket NUMA system to push the test further.
This sample program, although using NUMA capabilities, is not what I would consider a suitable application that would benefit from the NUMA capabilities of QichThread. Sample programs would be welcome, as well as time on larger NUMA class systems.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
[cpp]for (t = 0; t < T; ++t) cilk_for (i = 1; i < N-1; ++i) x[(t+1)%2] = .5 * (x[t%2][i+1] + x[t%2][i-1]);[/cpp]
(This is a pretty stupid solver of the 1D Poisson equation. The particular form of the right-hand side does not matter much for this discussion, as long as it depends on indices i-1, i, and i+1.)
A NUMA-aware scheduler that always does iterations 0..N/P on core 0, N/P..2N/P on core 1, etc., minimizes the shuffling of data. With Cilk and TBB, data moves around all the time.
You could argue that you need NUMA-awareness in this case, but the problem is that the naive nested loop that I showed above is a bad algorithm even in the sequential case, because it incurs a zillion cache misses. My paper describes a much better algorithm that works better sequentially, scales nicely with Cilk, and does not benefit at all from NUMA awareness. It is also more complicated than the simple nested loop shown above. So the tradeoff is between the complexity of a better algorithm, and the complexity of NUMA awareness (which still does not beat the better algorithm). I am not sure what the right answer is; presumably different people will prefer different design points.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dmitry,
don't be silly. We use Fibonacci as an example because it fits on a slide.
Of course we also did the usual matrix multiplication, LU, stencil, FFT, the usual boring stuff. We also tested Cilk with Cilkchess, at the time one of the best chess programs in the world (3rd place in 1994 International Computer Chess championship on 512 cores, 2nd place in 1995 on 1824 cores, 1st Place in the 1996 Dutch Open Computer Chess Championship on 12 cores, 3rd place in 1999 World Computer Chess Championships on 256 cores). We like having fun :-)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
No. We are talking about NUMA here, not caching (although the two problems are related, of course).
After you steal, data that is in remote DRAM is still in *remote* DRAM, and *cached* in local cache. Once evicted from the cache, the data must be fetched again from *remote* DRAM.
And this is precisely my point. If your algorithm uses the cache well, it does not matter whether the first cache miss is local or remote. If your algorithm does not use the cache well, you are hosed anyway.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dmitry,
don't be silly. We use Fibonacci as an example because it fits on a slide.
Of course we also did the usual matrix multiplication, LU, stencil, FFT, the usual boring stuff. We also tested Cilk with Cilkchess, at the time one of the best chess programs in the world (3rd place in 1994 International Computer Chess championship on 512 cores, 2nd place in 1995 on 1824 cores, 1st Place in the 1996 Dutch Open Computer Chess Championship on 12 cores, 3rd place in 1999 World Computer Chess Championships on 256 cores). We like having fun :-)
Fibbo is realy a bad parallization example. A good serial code is 15,000,000 times faster than that stupid recursive example. (on 4 core Q6600)
[cpp]// Serial non-recursive method to calculate Fibonacci series // *** Note, several orders of magnitude faster than all // *** recursive techniques. long SerialFib2( long n ) { if( n < 2 ) return n; long fib0 = 1; // most recent long fib1 = 1; // 1 prior to most recent long fib2 = 0; // 2 prior to most recent long i; for(i=2; i < n; ++i) { fib2 = fib1; fib1 = fib0; fib0 = fib1 + fib2; } return fib0; } [/cpp]
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
[cpp]// *** Note, several orders of magnitude faster than all // *** recursive techniques.[/cpp]is not true. Recursion is part of a good algorithm for Fib, and would be a good study of parallelism. It's just that a good algorithm for Fib will not fit on a slide, because it involves parallel FFTs.
Here's what a good algorithm for Fib might look like.Use repeated squaring of a 2x2 matrix (as noted in a footnote in the TBBtutorial). The matrix is [[1 1][1 0]]. It only requires O(lg(n)) squarings andsomematrix additions to compute Fib(n).The catch is that Fib(n)grows exponentially, solarge multiprecision arithmetic is required to represent the result for large n.Fast algorithms for large multprecision multiplication use FFTs. FFTs are best tackledby cache oblivious algorithms, which are recursive and canbe parallelized efficiently.
Asymptotic analysis: The precision [number of digits] of the values grows as O(lg(phi^n)), which is O(n) digits per numeral. Each FFT on an O(n) digitnumeral takes time O(n lg n).For O(lg n) squarings, that's time O(n lg^2 n). The non-recursive algorithm using summation requiresO(n) additions, each requiring up to time O(n). That's time O(n^2). Hence thesimple iterative algorithmis not faster than all recursive techniques.
It would be fun to have a contest to see who could write the fastest program for computing fib(10^9) :-)

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page