I'm looking for a TBB implementation of (dense 2D) matxtix-matrix multiplication that scales to ccNUMA (e.g., QuickPath) machines. I tried the cache oblivious version but I was surprised to find it didn't scale much more than the naive N^3 (cache-terrible) version. I did try a whole spectrum of base-case sizes (size of the sequential leaves) of the recursive cache-oblivious algorithm without success. On a 4x6core machine I get good scaling up to 6 workers and then performance flattens out: ~1x w. 1 worker, ~5x with 6 workers, ~7x with 24 workers (with exclusive use of the machine).
I am attempting to pin workers to cores (http://software.intel.com/en-us/forums/topic/365339) but I'm unsure if that will help substantially if I succeed.
Are you using the default auto_partitioner, or simple_partitioner with a grainsize?
(Added) As far as I know, NUMA is not yet supported, so your strategy should probably be to fill the local cache in the finest subdivision (using simple_partitioner and grainsize), and to exploit recursiveness to increase the odds that the next level(s) up are executed on the same thread/core and hence the same next-level cache. After that I don't know, but maybe somebody else will.
The cache oblivious algorithm is recursive (uses parallel invoke with 2 functions) so partitioners don't come into play. It is supposed to be cache optimal, but perhaps worker migration is messing things up (although I don't expect that to be the problem).
For the naive approach I've tried simple and auto partitioners with different grainsizes.
I have several other benchmarks that scale well on my ccNUMA. But MM doesn't so I'm trying to figure out why.
Er, that wasn't very helpful (to put it mildly), sorry. I should have just mentioned that NUMA is not yet supported as far as I know. As long as no stealing is involved, the lowest subdivisions will probably execute on the same thread/core/local cache, but after that there's no built-in mechanism to marry the algorithm hierarchy to the cache hierarchy.
I'm now looking at a cache-oblivious algorithm in "Structured Parallel Programming" by McCool, Robison and Reinders for exactly this purpose. It's in pseudocode, and TBB is mentioned, but not NUMA, except for "cache structure" where it probably really only supports various values of "cache size". Perhaps I could compare your code to it, although this algorithm is probably quite standard and so would have the same issues about NUMA.
If that is the case, you may be better off with something else than TBB, I'm sorry to say. I don't know if MKL does any better? Perhaps you could roll your own special-purpose version, depending on the size of the project. I once wrote a simple one for Java that would just demo a Mandelbrot set, so maybe it's not that hard even with NUMA added to the mix. But that's still not something you should have to do.
(Added) I would certainly examine MKL or perhaps other such libraries first, because they may well prioritise performance at the cost of weird tricks to deal with NUMA even if TBB doesn't. Cache-oblivious for NUMA sounds funny to me, though: you can obviously have something recursive that will fit in cache up to a certain level and you wouldn't have to know, something that will benefit sequential algorithms as well as parallel ones, but how are you going to decide to steal intermediate levels without knowing more about cache hierarchy? It seems to me that you will want to be able to tell the scheduler to steal the highest levels as far away as possible, and lower levels closer by, but then you would probably want to know those highest dimensions before you start, to be able to let child tasks inherit a range or so. But I'm just guessing...
Thank you, you have given me good leads. I will have to think in more detail about what is happening that's prevents scaling so drastically. I will also take a look at MKL, but my motivation for looking into this is research in flexible run-time systems (workstealing v.2.0). With the cache oblivious algorithm I'm using, I would expect to see good scaling for the simple strategy of always stealing the outermost because the sequential tile size is small enough to fit in L1 cache and 6 tiles should fit in L2 of each hexacore. I'm suspecting the way the lack of scalability in my implementation comes from bouncing of the cells of the result arrays from chip to chip. I plan to try an use some sort of reducer to keep these updates to a minimum. I'll post if I make some progress.
With the cache oblivious algorithm I'm using, I would expect to see good scaling for the simple strategy of always stealing the outermost because the sequential tile size is small enough to fit in L1 cache and 6 tiles should fit in L2 of each hexacore.
If you're doing research anyway, you might want to verify first that stealing from other sockets indeed has 3/4 probability in TBB's current implementation. Those seem like favourable odds, but the same odds also go for lower levels in the recursiveness hierarchy, where they work against the locality that you are looking for, maybe more than enough to upset results because on average only about (1-3/4)*(6-1)=5/4 thread ends up sharing the L2 cache (best I could do in a few seconds, probably close enough to the real value). Because communication trumps computation, you'd be a lot better off hard-coding 4 independent parts of the computation. With TBB, the easiest way to do that, probably, is to have 4 application threads, each of which will have an arena of workers that you can bind to cores using thread_scheduler_observer. Maybe you can also explore the previewed direct arena API.
I'm suspecting the way the lack of scalability in my implementation comes from bouncing of the cells of the result arrays from chip to chip. I plan to try an use some sort of reducer to keep these updates to a minimum.
I'm "not sure" (as they say) that I understand what you mean... It sounds to me somewhat as if you suspect that intermediate-level locality is all right and that the problems come from combining at a higher level. I don't see that (see above), so what's a reducer going to do?
(Added) Perhaps it's also a good thing to skew the decision about partitioning the problem somewhat in favour of partitioning at the very highest levels, even if this makes the parts less square than they could be, and less likely to fit in L2 cache. But you'll get there on further subdivisions anyway, and maybe you'll more than gain back what you lost. Just an idea, hopefully worth exploring...
Thanks for the feedback!
To answer the question on what I am hoping the reducer to help with: The current implementation of the cache-oblivious MM (take from cilk and adapted to TBB) does not parallelize the reduction operation. This means that there is at all times a single copy of the result array and all worker are updating it. A reducrer would create multiple copies of the result array (e.g. 1 per worker), locally aggregating results with no synchronization and finally combining the results using synchronization. It is easy to use a reducer for the naive 3-nested-loop matrix multiply, but for the cache-oblivious algorithm it's not as staight-forward. I haven't looked at it yet.
I admit that I'm not really familiar with the algorithm. I only know of the implementation in "Structured Parallel Programming", section "8.8 Cache Locality and Cache-Oblivious Algorithms", on pp. 227-230 (perhaps you are doing what is being referred to as the Strassen method instead?). I don't see a global reduction operation, but there's a non-obvious argument made against parallelising the multiplications before adding their results in the "else" clause, based on memory-bandwidth considerations, something where a NUMA machine without appropriate coding would seem especially vulnerable, so be careful with what your intuition tells you.