Re: Is the task stealer NUMA aware?

Prasun_Gera__Intel_ · ‎12-07-2009

Is TBB's task stealing mechanism NUMA aware? That is, for example, assume that there are four sockets(NUMA nodes) in the system with a four core chip on each of them, and each socket has its own low latency local memory. When a task queue of a particular core runs out of tasks, will it first try to steal from the cores on the same socket and then from remote ones ? Are there any other NUMA related performance issues?

Dmitry_Vyukov · ‎12-07-2009

Quoting - Gera Prasun Dineshkumar (Intel)

Is TBB's task stealing mechanism NUMA aware? That is, for example, assume that there are four sockets(NUMA nodes) in the system with a four core chip on each of them, and each socket has its own low latency local memory. When a task queue of a particular core runs out of tasks, will it first try to steal from the cores on the same socket and then from remote ones ? Are there any other NUMA related performance issues?

No, TBB scheduler is not NUMA aware, it uses Cilk-style randomized work-stealing. Randomized work-stealing does not care about NUMA, however provides some theoretical properties regarding time and space complexity of parallel computation.

Dmitry_Vyukov · ‎12-07-2009

If you need NUMA-aware scheduling you may check out QuickThread library:
http://software.intel.com/en-us/blogs/2009/01/21/affinities-and-opportunistic-thread-scheduling/
http://www.quickthreadprogramming.com/

Prasun_Gera__Intel_ · ‎12-07-2009

Quoting - Dmitriy Vyukov

No, TBB scheduler is not NUMA aware, it uses Cilk-style randomized work-stealing. Randomized work-stealing does not care about NUMA, however provides some theoretical properties regarding time and space complexity of parallel computation.

Hi, thanks for the tip about the Quick Thread Library. I'm reading the white paper and the manual, and will get back to you. However, I still find it unsettling thatTBB doesnt take NUMA architectures into account. I suppose that trying to steal from the same NUMA node first will lead to a reduction in the set of victim threads and might lead to a penalty incurred in the form of increased unsuccessful steals, but that is largely problem dependent in that the occupancy of queues and the skew between different nodes at any point of time vary from problem to problem, and the choice can be left to the user. Also, what kind of theoretical bounds does random stealing provide?

Dmitry_Vyukov · ‎12-08-2009

Quoting - Gera Prasun Dineshkumar (Intel)

Hi, thanks for the tip about the Quick Thread Library. I'm reading the white paper and the manual, and will get back to you. However, I still find it unsettling thatTBB doesnt take NUMA architectures into account. I suppose that trying to steal from the same NUMA node first will lead to a reduction in the set of victim threads and might lead to a penalty incurred in the form of increased unsuccessful steals, but that is largely problem dependent in that the occupancy of queues and the skew between different nodes at any point of time vary from problem to problem, and the choice can be left to the user. Also, what kind of theoretical bounds does random stealing provide?

Regarding QuickThread, you may address directly to Jim Dempsey, you can find his email on http://www.quickthreadprogramming.com

Regarding NUMA and random stealing, I think I will not be able to retell all that stuff, so please refer to "Space-Efficient Scheduling of Multithreaded Computations":
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.44.6193
and other Cilk papers:
http://supertech.csail.mit.edu/papers.html

I agree with you that NUMA awareness is indeed of help for efficient scheduling, and not only NUMA awareness but also HyperThreading awareness, shared cache awareness, etc. Actually I incline towards the opinion that the only good thing in randomized stealing is that it allows one to make statements like "Each thread is located independently at random, and hence, the random variable Zi has a binomial distribution with P lg P trials and success probability 1/P". IMVHO more practical scheduler will tend to be locality/NUMA/cache/HT aware, and apply some supervision on top of that to prevent possibility of very bad executions, so to say. I.e. randomized scheduler penalizes all cases in the name of theoretical properties. And practical scheduler must preserve locality in common case, and preserve theoretical properties by some episodic low-overhead activity.

Matteo_Frigo · ‎12-13-2009

Dmitriy,

it is unfair to say that a randomized scheduler penalizes all cases in the name of theoretical properties. It turns out that when you actually try policies like the one suggested by the original poster, they just don't work. For your information, we (the Cilk research group at MIT) tried locality-aware work stealing in 1997 and it did not make any difference. We tried on what you might consider the mother of all NUMA machines---a cluster of SMPs with shared memory over a network (I forgot which one---could even have been ethernet). Keith Randall has some discussion about this topic in his Ph.D. thesis. So the lack of NUMA-awareness in Cilk and TBB is not due to theoretical considerations, but to the fact that there seems to be no practical way to produce a scheduler that will take advantage of NUMA.

This problem is much more complicated than it appears at first sight. For example, every steal causes a disruption of the memory access pattern, in the worst case requiring a transfer of the whole cache of the ``victim'' processor to the ``thief'' processor. The steal operation itself costs nothing (a few hundred cycles), but the cache disruption costs you tens of thousands of cycles. This is a real phenomenon that you can observe in common algorithms, such as LU decomposition or stencils. From this perspective, you want to minimize the number of steals. Randomized work stealing does that; any attempt to bias the stealing activity will in general increase the number of steals, and thus of cache misses.

To make things even more complicated, there is the issue of what to steal. The Cilk/TBB policy is a pretty good one for a parallel machine with private (i.e., non-shared) caches, because in practice it tends to steal big chunks of work that are independent of each other. (This is not a theoretically necessary property, but in practice this is a common case.) However, the Cilk policy is bad if caches are shared between the thief and the victim, because it causes two big chunks of work to reside in the same cache, which tends to overflow the cache. Nobody knows what a good stealing policy would be on machines with both shared and private caches, i.e., any mainstream multicore processor sold today :-( There are some partial results on this topic: read my paper ``The cache complexity of multithreaded cache oblivious algorithms'' for a partial understanding of the private-cache case, and the referenced work by Blelloch and Gibbons for the case of shared caches. (Hyperthreading falls into the shared-cache category, sharing everything including L1.)

More complicated still, keep in mind that there is no such thing as ``NUMA'' or ``affinity'' once you take into account that your algorithm may be called in parallel multiple times---that is, any practical software must be usable as a subroutine in a larger context. E.g., you may try to place data and schedule things carefully so that the computation happens where the data is, but your careful work goes out of the window once the user of your program spawns something else in parallel with it. You must be careful in not evaluating an algorithm in isolation, without reference to the larger context.

In conclusion, the issue is complicated, but you can be sure that existing designs such as Cilk and TBB are more motivated by what works in practice than by any kind of theoretical consideration. It just happens that the Cilk theory is one of the few theories that work :-)

RafSchietekat · ‎12-14-2009

Interesting... If you don't mind a naive question (without first having read the available materials), what would you say to a heuristic that would first examine possible work from "close" threads in the hardware sense (cache layout), but limit how "far" the work can be in the software sense (task depth difference)? If no work is available from a random victim in the first candidate group within the eligible-depth window, the algorithm would examine another random victim but from both an enlarged candidate group and depth window. This may very well require experimentation and tuning (made more difficult because it would likely depend on what code is being executed), but some applications may warrant such an effort.

(Added) If it has been addressed in a reference, juststating thatis OK for me.

Matteo_Frigo · ‎12-14-2009

Quoting - Raf Schietekat

Interesting... If you don't mind a naive question (without first having read the available materials), what would you say to a heuristic that would first examine possible work from "close" threads in the hardware sense (cache layout), but limit how "far" the work can be in the software sense (task depth difference)? If no work is available from a random victim in the first candidate group within the eligible-depth window, the algorithm would examine another random victim but from both an enlarged candidate group and depth window. This may very well require experimentation and tuning (made more difficult because it would likely depend on what code is being executed), but some applications may warrant such an effort.

(Added) If it has been addressed in a reference, juststating thatis OK for me.

Raf,

this is one of the things that Keith Randall tried. He also tried things of the form: steal from ``close'' workers with probability X, and from ``far'' workers with probability (1-X). Tweak X. None of these attempts made much difference. He has a few comments at the end of his thesis, but academia being what it is, you cannot publish things that don't work, so everybody has to repeat the same mistakes all over again :-)

There are two fundamental problems, I think, that any scheme of this kind will incur. The first is that whatever work is in a queue close to you is not necessarily using data that is close to you, so there is no reason why stealing close work would be a good idea (except for reducing the cost of stealing itself). The second is that, even assuming that there is a correlation between the location of the work and the location of the data, this correlation is broken as soon as you perform one remote steal. Assume that worker W manages to steal a remote chunk of work. From that point on, every worker closer to W will prefer stealing from W, and will steal exactly the wrong thing (because W's work was using remote data to begin with).

I suspect that there is a thermodynamic analogy at play here, although I have no proof. If you drop ink in water, the ink diffuses as a brownian motion, which is in the limit mathematically equivalent to a randomized work-stealing from the nearest neighbors in a 3D grid. Over time, the ink diffuses into a maximum-entropy distribution that is indistinguishable from the distribution that you would get by allowing Cilk-style stealing of ``work molecules'' (this would happen with ink molecules of infinite velocity). So the system always converges with the same equilibrium distribution; the only difference is that randomized work stealing converges to the equilibrium faster.

RafSchietekat · ‎12-14-2009

"academia being what it is, you cannot publish things that don't work"
Ah?

"whatever work is in a queue close to you is not necessarily using data that is close to you"
Hmm...

Dmitry_Vyukov · ‎12-14-2009

Quoting - Matteo Frigo (Intel)

it is unfair to say that a randomized scheduler penalizes all cases in the name of theoretical properties. It turns out that when you actually try policies like the one suggested by the original poster, they just don't work. For your information, we (the Cilk research group at MIT) tried locality-aware work stealing in 1997 and it did not make any difference. We tried on what you might consider the mother of all NUMA machines---a cluster of SMPs with shared memory over a network (I forgot which one---could even have been ethernet). Keith Randall has some discussion about this topic in his Ph.D. thesis. So the lack of NUMA-awareness in Cilk and TBB is not due to theoretical considerations, but to the fact that there seems to be no practical way to produce a scheduler that will take advantage of NUMA.

Hi Matteo,

Well, I can't say that there is a practical way to produce a scheduler that will take advantage of NUMA, because I did not create one... ah, and not even tried to create one.

Quoting - Matteo Frigo (Intel)

This problem is much more complicated than it appears at first sight. For example, every steal causes a disruption of the memory access pattern, in the worst case requiring a transfer of the whole cache of the ``victim'' processor to the ``thief'' processor. The steal operation itself costs nothing (a few hundred cycles), but the cache disruption costs you tens of thousands of cycles. This is a real phenomenon that you can observe in common algorithms, such as LU decomposition or stencils. From this perspective, you want to minimize the number of steals. Randomized work stealing does that; any attempt to bias the stealing activity will in general increase the number of steals, and thus of cache misses.

My key point is - how can *randomized* scheduler ever ensure that/anything? It may do only not too bad, but it can't do too good either.

Quoting - Matteo Frigo (Intel)

To make things even more complicated, there is the issue of what to steal. The Cilk/TBB policy is a pretty good one for a parallel machine with private (i.e., non-shared) caches, because in practice it tends to steal big chunks of work that are independent of each other.

In what way? Random scheduler equally prefers small chunks. I would say that it actually tends to steal small chunks of work (I guess there are more of them), but also sometimes "with probability 1/P" steals big chunks too. If one wants to steal big chunks -> steal big chunks, do not steal at random. What I am missing here?

Quoting - Matteo Frigo (Intel)

More complicated still, keep in mind that there is no such thing as ``NUMA'' or ``affinity'' once you take into account that your algorithm may be called in parallel multiple times---that is, any practical software must be usable as a subroutine in a larger context. E.g., you may try to place data and schedule things carefully so that the computation happens where the data is, but your careful work goes out of the window once the user of your program spawns something else in parallel with it. You must be careful in not evaluating an algorithm in isolation, without reference to the larger context.

Indeed. If I have 2 NUMA nodes and need to sort 2 independent arrays, I would prefer to dedicate first NUMA node to first arary, and second NUMA node to second array, and not to blend everything.

Regarding practical ways. What do you think on following?
For Hyperthreading (somehow applies to shared caches too): always try to steal from HT-sibling in FIFO order first. HT-siblings do not have enough resources to work on separate tasks, so it's better for them to help each other with single task. Steals for HT-siblings are naturally costless, so more frequent steals won't harm.
Locality-aware stealing: steal in the following order: (1) prefer to steal from nearest victim with a lot of work, (2) then victim with a lot of work, (3) then nearest victim, (4) then everything else. I.e. stealing is based on distance/amount_of_work metric. Locality is good, stealing of big chunks is good, right? So why not prefer that?
Work-first vs Help-first. Try to balance between Work-first vs Help-first scheduling for the following code (your example):
while (something)
cilk_spawn foo();
I.e. try to steal even more work in specific situations.

Actually "smart" scheduler may introduce some randomness too to not behave too bad in unexpected situations. For example, it can do "non-random" scheduling on even steals and do "random" scheduling on odd steals; that way probability of something good/bad is just 1/2P instead of 1/P for pure random scheduler; but who cares, it's just as we have 2 times more processors.

p.s. I still remember about the following conversation:
http://software.intel.com/en-us/forums/showthread.php?t=69681
Just do not have much time right now... and it's quite difficult to reverse-engineer assembly (I've intercepted Cilk++ "preprocessor" intermediate output file, it helps somehow).

Dmitry_Vyukov · ‎12-14-2009

Quoting - Matteo Frigo (Intel)

this is one of the things that Keith Randall tried. He also tried things of the form: steal from ``close'' workers with probability X, and from ``far'' workers with probability (1-X). Tweak X. None of these attempts made much difference. He has a few comments at the end of his thesis, but academia being what it is, you cannot publish things that don't work, so everybody has to repeat the same mistakes all over again :-)

It's not surprising that it did not make much difference if he tested on Fibbonacci program :)

Quoting - Matteo Frigo (Intel)

The first is that whatever work is in a queue close to you is not necessarily using data that is close to you, so there is no reason why stealing close work would be a good idea (except for reducing the cost of stealing itself).

Well, there are 1000 and 1 way how user may compromise performance, runtime scheduler will never win in this game.
However runtime scheduler may provide good performance for good programs. Doesn't you require busy-leaves, etc? For matrix-multiplication, merge-sort and all that conquer-and-divide stuff close queue means close data. I think it's worth providing better performance for at least those programs.

Quoting - Matteo Frigo (Intel)

The second is that, even assuming that there is a correlation between the location of the work and the location of the data, this correlation is broken as soon as you perform one remote steal. Assume that worker W manages to steal a remote chunk of work. From that point on, every worker closer to W will prefer stealing from W, and will steal exactly the wrong thing (because W's work was using remote data to begin with).

After remote steal stolen remote data becomes local data. So every worker close to W will do right thing rather than wrong. What else it can do? It can't steal anything local (there is no local since W did remote steal). It can only steal even more remote work and cause even more misses, is it right thing?
Since W already stolen remote piece of work X, W will have to fetch whole X sooner or later. So if workers close to W will help in that they do not increase number of misses in any way (they do increase if they will do something else).

Quoting - Matteo Frigo (Intel)

I suspect that there is a thermodynamic analogy at play here, although I have no proof. If you drop ink in water, the ink diffuses as a brownian motion, which is in the limit mathematically equivalent to a randomized work-stealing from the nearest neighbors in a 3D grid. Over time, the ink diffuses into a maximum-entropy distribution that is indistinguishable from the distribution that you would get by allowing Cilk-style stealing of ``work molecules'' (this would happen with ink molecules of infinite velocity). So the system always converges with the same equilibrium distribution; the only difference is that randomized work stealing converges to the equilibrium faster.

Well, ok, if we have following distribution of work:

02111111111120

What is better:
1. Fill first '0' by splitting second '2'; and fill last '0' by splitting last but one '2'.
2. Do vice versa
?

Prasun_Gera__Intel_ · ‎12-14-2009

Quoting - Matteo Frigo (Intel)

Raf,

this is one of the things that Keith Randall tried. He also tried things of the form: steal from ``close'' workers with probability X, and from ``far'' workers with probability (1-X). Tweak X. None of these attempts made much difference. He has a few comments at the end of his thesis, but academia being what it is, you cannot publish things that don't work, so everybody has to repeat the same mistakes all over again :-)

Hi,

I haven't read the thesis in detail, but yes i read the last part where he talks about taking the ratios of latencies of local access and remote access into account for biasing the stealing, which is exactly what i meant. So when you say that the actual stealing takes only few cycles, if the latency is included in the stealing cost, doesn't the disparity between local stealing and remote stealing manifest quite strongly in the case of NUMA?

There are two fundamental problems, I think, that any scheme of this kind will incur. The first is that whatever work is in a queue close to you is not necessarily using data that is close to you, so there is no reason why stealing close work would be a good idea (except for reducing the cost of stealing itself).

If we consider comparing the probability ofsuccessfullystealing related('data that is close to you') work within a node to the probability of successfully stealing related work from any core at random, this was the crude oversimplified calculation that i could come up with:

Assumptions:

p cores per node

n such nodes

pn cores in all

For a given thief at a a point of time, pn-1 victims.

m out of them have related data. (Assume m>p for this case)

For a random steal, probability of success = m/pn-1

For a local steal, probability of success = (1/(p-1))*(Prob that one core in the node has related data) +(2/p-1)*(Prob that 2 cores in the node have rel. datat ) + ... + 1( Prob that p cores ...t)

and Prob that k cores out of p have related data = pCk (m/pn-1)^k * (1- m/pn-1)^p-k

So we can calculate the probability that a steal gets related data for both, the randomized and the biased case.

Similarly, we can calculate the probability of a steal beingsuccessfullyfor both the cases.

So the total time for a steal = (t successful steal)(Prob success) + (t unsuccessful) (prob failure)

where t success =t related(prob related) + t unrelated(prob unrelated)

where we capture all sorts of benefits and penalties in the t terms.

So my question is that wouldn't there be cases where for certain values of the variables in the above eq, t biased would outperform t random. As i was trying to say earlier, isn't it significantly dependent on the problem and/or hardware/topology?

The second is that, even assuming that there is a correlation between the location of the work and the location of the data, this correlation is broken as soon as you perform one remote steal. Assume that worker W manages to steal a remote chunk of work. From that point on, every worker closer to W will prefer stealing from W, and will steal exactly the wrong thing (because W's work was using remote data to begin with).

Can't the sameargument about probabilities be applied to this too? I mean if we look at the probabilities, I feel its again about trying out different things, and for some of them one would work, for some of them the other would.

RafSchietekat · ‎12-14-2009

Perhaps a bit more articulate this time:

"If you drop ink in water, the ink diffuses as a brownian motion, which is in the limit mathematically equivalent to a randomized work-stealing from the nearest neighbors in a 3D grid."
It's not really ink so much as grains. And if you process grains faster in a localised setting (hyperthreaded core, cores sharing a cache) perhaps they locally disappear more quickly, counteracting the spread.

"So the system always converges with the same equilibrium distribution; the only difference is that randomized work stealing converges to the equilibrium faster."
But how do you know that that is not a negative thing...

I really should read some of that study material, though. Meanwhile, perhaps somebody who has actually tried to implement and exploit a NUMA-aware multithreading toolkit might decide to chime in?

jimdempseyatthecove · ‎12-15-2009

Gera,

A distinction has to be made herebetween a numerical probability derived from the assumption of random selection within a subset and the deterministic selection within a scheduler. Depending on the circumstances (tasks in queue) the numerical probability may be significantly less than 1 whereas the deterministic selection within a scheduler may always yield 1.

When a programmer is given the means to provide an indication of local preference for task(s) (don't care, preferred set, required set) as well as control over task teams, they can thenpursue a coordinated effort at producing a solution.

The trick in doing this is in providing a low overhead means of performing the core and NUMA node selections. When I designed this capability into the QuickThread scheduler special attention was given to making the overhead as low as possible. Most of the overhead consists of a table lookup and write of a bitmask. Some table lookups take more time than others.

In the case where no preference is specified (e.g. parallel_for) the team selection is similar to TBB/Cilk: all threads, however QT has a bias towards current thread (thread issuing parallel_for). When you schedule under "Hot in Cache" situations you would specify a team based upon those sharing your thread's specified cache level (L1, L2, L3) this is a table lookup for bitmask. For "Not in Cache" you might want to direct the scheduler to choose an L3 with the most idle threads. This incurs a little more overhead but with the hope of payback with higher cache hits.Or you may have the occasion for each object to contain a preferred or required thread selection. Or certain task may be restricted to particular cores (e.g. task driving GPU).

If you or Matteo have an example program that _should_ exhibit improvement with cache and NUMA aware scheduling (but has not due to lack of threading tool) I would appreciate you contacting me. A good sample program would be nice. But good sample programs are hard to come by. The sample program has to do real work.

Jim Dempsey

Dmitry_Vyukov · ‎12-15-2009

Quoting - jimdempseyatthecove

When a programmer is given the means to provide an indication of local preference for task(s) (don't care, preferred set, required set) as well as control over task teams, they can thenpursue a coordinated effort at producing a solution.

Jim, and what do you think regarding possibility of locality/NUMA-aware scheduling *without* any hints from a programmer? Definitely it will be less efficient than a programmer-driven scheduling (like in QuickThread), but is there something runtime scheduler can take advantage of automatically? Or blind random scheduling is the best practical solution?

jimdempseyatthecove · ‎12-16-2009

Quoting - Dmitriy Vyukov

Jim, and what do you think regarding possibility of locality/NUMA-aware scheduling *without* any hints from a programmer? Definitely it will be less efficient than a programmer-driven scheduling (like in QuickThread), but is there something runtime scheduler can take advantage of automatically? Or blind random scheduling is the best practical solution?

When operating *without* any hints from a programmersuch as what is available with TBB, Cilk, OpenMP, then the designers of those schedulers can take either a "don't care" method .OR. take a method that has some potential for benefit on NUMA platform. I haven't examined closely what TBB does, but I think the effect will be something along the line of when affinity pinning is in effect, parallel constructs tend to favor closer threads first.

While this technique works well in the inner most loops, the outer most loops (when you havemulti-levels of nesting and multi-socket/nodes) may tend to interfere with your cache utilization. In some cases you will want to scatter the outer loops amongst sockets/nodes but schedule the inner most loops within the socket/node. When you have no means as a programmer as to your preference (TBB/Cilk), the task scheduler has only one selection criteria (don't care or nearest(/farthest)). The programmer though will have apriori knowledge of the data access patterns and should be able to supply a reasonable hint as to the most effective task scheduling strategy provided they have a means of specifying this hint (e.g. QuickThread).

Although the following is not diffinative:

http://www.quickthreadprogramming.com/Comparative%20analysis%20between%20QuickThread%20and%20Intel%20Threading%20Building%20Blocks%20009.pdf

page 25, "Throughput of pipeline"

You can see an example ofthe QuickThreadparallel_pipeline run on a Dell R710 dual socket Xeon 5570NUMA class system with RAID10 disk controller. (unfortunately I cannot paste the chart in here)

The QuickThreadparallel_pipeline is NUMA aware. The default settings are to allocate 4 buffers per hardware thread distributed across availableNUMA nodes. 2 additional buffers are allocated as a provision for 2 additional I/O threads. This system has 16 HW threads (2 x 4-core with HT). This QT pipeline therefore runs with 18 threads. The 2 I/O threads (one for input end and one for output end of pipeline)are not affinity pinned, the 16 compute threads are affinity pinned. When the input end (pipe)of the pipeline receives a buffer it may come from any NUMA node (2 on this system). When the read is done, the buffer is passed from the I/O class thread to a compute class thread with a preference to be run on the NUMA nodefrom which it was allocated.

The test program is relatively simple in that it performs a sequential read of a file, up-casing words, while writing results out to an output file. The file was about 659MB in size. The particular setup scheduled adjacent HW threads (e.g. core, core's HT sibling, next core,...) although it would have been relatively easy to schedule core, core, core then back fill with HT siblings.

The scaling was linear upto the point where the compute threads and I/O threads were competing for HW thread resource (a little dip from linear scaling). The slope was ~linear but not 1:1 as there appears to be a fixed overhead per I/O request to the operating system (Windows Server x64 2008).

When all 16 compute class threads were running the I/O bandwidth was about 1.65GB/s meaning the computation throughput was also ~1.65GB/s (fewer writes than reads but each write within same 8-byte word). A total application memory throughput of about 3.3GB/s.

Unfortunately I did not have access to a 4 socket NUMA system to push the test further.

This sample program, although using NUMA capabilities, is not what I would consider a suitable application that would benefit from the NUMA capabilities of QichThread. Sample programs would be welcome, as well as time on larger NUMA class systems.

Jim Dempsey

Matteo_Frigo · ‎12-17-2009

The standard program that benefits from NUMA awareness/affinity is the naive three point stencil:

[cpp]for (t = 0; t < T; ++t)
  cilk_for (i = 1; i < N-1; ++i)
     x[(t+1)%2] = .5 * (x[t%2][i+1] + x[t%2][i-1]);[/cpp]

(This is a pretty stupid solver of the 1D Poisson equation. The particular form of the right-hand side does not matter much for this discussion, as long as it depends on indices i-1, i, and i+1.)

A NUMA-aware scheduler that always does iterations 0..N/P on core 0, N/P..2N/P on core 1, etc., minimizes the shuffling of data. With Cilk and TBB, data moves around all the time.

You could argue that you need NUMA-awareness in this case, but the problem is that the naive nested loop that I showed above is a bad algorithm even in the sequential case, because it incurs a zillion cache misses. My paper describes a much better algorithm that works better sequentially, scales nicely with Cilk, and does not benefit at all from NUMA awareness. It is also more complicated than the simple nested loop shown above. So the tradeoff is between the complexity of a better algorithm, and the complexity of NUMA awareness (which still does not beat the better algorithm). I am not sure what the right answer is; presumably different people will prefer different design points.

Matteo_Frigo · ‎12-17-2009

Quoting - Dmitriy Vyukov

It's not surprising that it did not make much difference if he tested on Fibbonacci program :)

Dmitry,

don't be silly. We use Fibonacci as an example because it fits on a slide.

Of course we also did the usual matrix multiplication, LU, stencil, FFT, the usual boring stuff. We also tested Cilk with Cilkchess, at the time one of the best chess programs in the world (3rd place in 1994 International Computer Chess championship on 512 cores, 2nd place in 1995 on 1824 cores, 1st Place in the 1996 Dutch Open Computer Chess Championship on 12 cores, 3rd place in 1999 World Computer Chess Championships on 256 cores). We like having fun :-)

Matteo_Frigo · ‎12-17-2009

Quoting - Dmitriy Vyukov

After remote steal stolen remote data becomes local data.

No. We are talking about NUMA here, not caching (although the two problems are related, of course).

After you steal, data that is in remote DRAM is still in *remote* DRAM, and *cached* in local cache. Once evicted from the cache, the data must be fetched again from *remote* DRAM.

And this is precisely my point. If your algorithm uses the cache well, it does not matter whether the first cache miss is local or remote. If your algorithm does not use the cache well, you are hosed anyway.

jimdempseyatthecove · ‎12-18-2009

Quoting - Matteo Frigo (Intel)

Dmitry,

don't be silly. We use Fibonacci as an example because it fits on a slide.

Of course we also did the usual matrix multiplication, LU, stencil, FFT, the usual boring stuff. We also tested Cilk with Cilkchess, at the time one of the best chess programs in the world (3rd place in 1994 International Computer Chess championship on 512 cores, 2nd place in 1995 on 1824 cores, 1st Place in the 1996 Dutch Open Computer Chess Championship on 12 cores, 3rd place in 1999 World Computer Chess Championships on 256 cores). We like having fun :-)

Fibbo is realy a bad parallization example. A good serial code is 15,000,000 times faster than that stupid recursive example. (on 4 core Q6600)

[cpp]// Serial non-recursive method to calculate Fibonacci series
// *** Note, several orders of magnitude faster than all
// *** recursive techniques.
long SerialFib2( long n )
{
	if( n < 2 )
		return n;
	long fib0 = 1;	// most recent
	long fib1 = 1;	// 1 prior to most recent
	long fib2 = 0;	// 2 prior to most recent
	long i;
	for(i=2; i < n; ++i)
	{
		fib2 = fib1;
		fib1 = fib0;
		fib0 = fib1 + fib2;
	}
	return fib0;
}
[/cpp]

Jim Dempsey

ARCH_R_Intel · ‎12-18-2009

The claim

[cpp]// *** Note, several orders of magnitude faster than all   
// *** recursive techniques.[/cpp]

is not true. Recursion is part of a good algorithm for Fib, and would be a good study of parallelism. It's just that a good algorithm for Fib will not fit on a slide, because it involves parallel FFTs.

Here's what a good algorithm for Fib might look like.Use repeated squaring of a 2x2 matrix (as noted in a footnote in the TBBtutorial). The matrix is [[1 1][1 0]]. It only requires O(lg(n)) squarings andsomematrix additions to compute Fib(n).The catch is that Fib(n)grows exponentially, solarge multiprecision arithmetic is required to represent the result for large n.Fast algorithms for large multprecision multiplication use FFTs. FFTs are best tackledby cache oblivious algorithms, which are recursive and canbe parallelized efficiently.

Asymptotic analysis: The precision [number of digits] of the values grows as O(lg(phi^n)), which is O(n) digits per numeral. Each FFT on an O(n) digitnumeral takes time O(n lg n).For O(lg n) squarings, that's time O(n lg^2 n). The non-recursive algorithm using summation requiresO(n) additions, each requiring up to time O(n). That's time O(n^2). Hence thesimple iterative algorithmis not faster than all recursive techniques.

It would be fun to have a contest to see who could write the fastest program for computing fib(10^9) :-)