Extending TBB to Clusters: My Summer Research Project

AJ13 · ‎08-24-2008

I've been busy constructing my cluster library, which currently involves wrapping C code for RDMA (librdmacm) with C++ constructs to give exception safety, and make my life easier in general by giving the library a C++ feel. Rather than giving a nice design document with diagrams, I thought I'd give a verbal outline of what I'm doing via this post.

RDMA allows you to simply transport memory directly from one compute node to another (and with the right equipment, that memory transfer would not involve kernel handling, or CPU time... i.e infiniband cards). With direct access to memory, I can transport a class from one system to another and just have to worry about endian-types... and pass a message to say "you have something to do, here it is."

The principal concept of my cluster-TBB is to provide remote memory access. Remote memory objects are like a smart-pointer, where dereferencing will involve fetching the memory from a remote system and making a local copy. Memory objects must meet a defined "memory concept" which basically describes things like equality, and dereferencing. I provide several memory concept classes, including: local_memory, remote_memory, and cached_remote_memory. Obviously remote memory presents a massive potential bottleneck, and I'll get to that in a second. Remote memory objects contain within them everything required to access a remote memory location via RDMA, so they can easily be transported across the network without links to local information (as in, no pointers in them, just POD). These remote objects won't be invalidated if they are on a different machine and dereferenced because they know where to get their memory from. If "remote" memory is accessed by the same system that owns that memory, the RDMA calls should be bypassed. local_memory is just something that matches the memory concept, and compiles down to pointer operations. It's there for generic algorithms. RDMA also has atomic operations, so there are also local_atomic and remote_atomic concepts.

There are many ways to execute parallel algorithms. On a local machine, TBB should manage execution and the concept of remote memory maps to normal pointer operations. On a cluster, my cluster library should be used and remote memory should map to RDMA operations. This is where I love templates. I've defined an enum execution_method { LOCAL, CLUSTER}; which dictates how parallel execution should proceed. A const execution_method variable is set in a header file (parallel_execution.h, local_execution.h) and this drives template specialization under the hood.... so that memory templates are built properly. For local execution, tbb_scheduler_init is setup and for cluster execution, connections to compute nodes are built. This allows the under-the-hood implementation to be completely specialized for the execution method, but the user never sees that. The user just has to think at a high level (with low-level details in mind). Also this means that methods of parallel execution I can't think of can be handled, you just have to provide a new specialization for a new enum value.

Now to talk about clusters. The TBB algorithm templates parallel_for and parallel_reduce already do a good job of expressing parallelism at a high level. I intend to steal these templates, put them in a new namespace, and then map these templates to particular executions based on the execution_method value defined. If it's local execution, then the template should compile down to just be TBB calls. In the cluster case, things are a bit more complex. The goal is to have the same high-level feel of TBB, with the user being aware of low-level concerns that will drive their performance. The other advantage of having a single source tree that compiles to TBB or cluster execution, is that users can develop on multi-core machines before moving to the super computer. Also, issues that crop up on one execution system might not be there on the other, so the user can debug in two very different environments to see where issues with their code lay.

The role of my cluster library is only to feed TBB. tbb::tasks (or a higher-level task really that casts to tbb::tasks or something) are spread out amongst compute nodes for computation using the TBB partitioner concept but for the cluster library. Tasks are balanced across compute nodes (and hence across TBB instances, and hence cores) by work-stealing at a node level. When a TBB instance has no work, the cluster library can go out and steal work from a different TBB instance. The TBB Range concept is mapped to compute nodes. At a high level, the TBB templates are still used but are just mapped in different ways.

Tasks which use remote_memory operations are able to just go ahead and access memory as if it was local. This means that tasks can be easily serialized for transport... the contained memory objects know where the memory is and how to get it. Dereferencing calls will be mapped to RDMA operations, and the task will have the data it needs during execution. Unfortunately, users might have to use mutexes in some cases, so remote_atomic can be used to construct a cluster-level mutex. This is instant death of performance, but it's up to the user to be careful... there are some cases where a lightly contested mutex might be required for correct execution. I provide ways of reading memory, but there is no way of knowing if someone else changed it while we were using it. Really, multicore programming has the same concern. Users have to develop good parallel algorithms.

Memory access will be slow. The performance of my library will be bound by memory access times. Because memory has been abstracted, it's possible to do something about this. I'm going to be looking at building something like the tbb::affinity_partitioner to try to map tasks to nodes where access will be local. The best performance right now is going to come from cluster tasks spawning many tbb::task objects which work on locally constructed objects. So the cluster task is just a package which is used to do a more serious calculation (for instance, the cluster task is an image... and local tbb::tasks are spawned to look at properties of that image).

There is another thing to mention with memory... remote_memory will never make it into the cache. This is a killer... because each dereference will wind up going out to the network. For that reason, I'm also providing a cached_remote_memory which will have a local copy of some memory... and that can make it into the cache.

This is the general idea of my cluster library. I'd appreciate feedback, and I'll have code to post very shortly. The goal is to create as thin a layer as possible above TBB to enable cluster execution... and I've thought a lot about how to do that.

Thanks,

AJ

ARCH_R_Intel · ‎08-25-2008

How are you dealing with marshaling issues? E.g., if I have a std::list on one node and use RDMA to copy it to another node, the std::list makes no sense on the second node unless every bit was copied to the same address it had on the first node.

RafSchietekat · ‎08-25-2008

Hello Adrien, nice to see you didn't explode or anything...

I think you already indicated that there is no marshalling (only self-contained PODs allowed), which is the whole point of RDMA, so you'll need the same layout rules at all participating nodes (probably by allowing only homogeneous nodes), but why RDMA instead of MPI, and how do you deal with the different orders of magnitude in the penalties involved for making unfortunate scheduling decisions? How would NUMA fit into the picture?

AJ13 · ‎08-25-2008

You're right, STL containers won't work correctly because they have internal links. Theoretically I could do something like remote_memory<:LIST> > but in reality that wouldn't work (according to Meyers, the allocator isn't always called for allocating objects). The reality is that this type of construct would have to go hand-in-hand with a redevelopment of STL containers for parallelism... which is something looming ahead anyways.

AJ13 · ‎08-25-2008

Hey Raf,

Yah I've been super busy thinking on research. I've noticed you've done quite a bit with porting atomics, thanks!

The layout rules would be accomplished by having the same binary executing on all involved nodes... that way the memory layout of classes should be the same... although I might be wrong on that... I'm not sure how much will depend upon the compiler.

I didn't like the idea of using MPI, because I'd have to use their types... which could make serializing something like a graph to MPI types difficult... plus the wasted effort and processor cycles. Yes, MPI-2 does have a remote memory access thing, but I felt that I'd be better off just going ahead with raw RDMA since that's the only part I wanted anyways. I don't remember if MPI-2 defined atomics or not either.

So far as scheduling the cluster, this is where I hope that work stealing will save us. Although random stealing might not work well when there are 60,000 cores involved. However I think available nodes will broadcast their lack of work, and they can be grabbed by chugging processors... something like that.

RafSchietekat · ‎08-25-2008

Hmm, how feasible would it be to do some kind of in-place marshalling (an object that keeps its own state in memory in a platform-independent way, ready for "zero-copy", while still offering a real C++ interface to its local users)...

I didn't mean to imply any marshalling, and apparently Wikipedia even suggests other alternatives for the "problems" with RDMA.

I'm not sure atomics are suitable for use on clusters. There have already been other attempts to pretend that everything is local, which have not turned out well; TBB itself is for a large part about breaking away from that attitude even on a supposedly symmetric multiprocessor system. Their primary advantage (even above any speedup) is probably to avoid priority inversion, convoying, deadlock because the current owner kicked the bucket, but that can be served also by hardware locking or by transactional memory (bold vs. shy); BTW, I view the primitives on Alpha/ARM/MIPS/POWER/PowerPC as primordial transactional memory operations and am looking forward to how Rock will turn out. On a cluster, you will probably be better off with a message (also implementing the memory semantics, non-global in this case) and by designing to be failure-resistant without the use of a distribute transaction if you can avoid one (I can't believe I'm writing this): it might very well scale better to just send tasks off into the world and reissuing them if a timely response didn't come back, rather than keeping tight control over them in distributed transactions, if the business allows that, and any distributed transaction should be a lot more interesting than a simple atomic.

Well, that's just my initial impression, not to be taken too seriously (I would be happy to be corrected!), because unfortunately I have not yet had the opportunity to play with this.

AJ13 · ‎08-25-2008

Thanks for the feedback Raf.

I don't propose to provide a shared memory view to the user, i.e. I don't intend to construct a virtual shared memory system where every variable is shared... the performance on such a beast would be dismal.

However rather than using a network interface for sending / receiving objects, I felt RDMA would be a better candidate for that. Compute nodes will still be running a listening server, which will receive events much like MPI... the events will be simple, and will reference memory that the node has already received so there is no polling... worst case scenario the compute node is blocked, waiting for a message on the socket... at which point other processes on the node can take priority thanks to the OS scheduler. RDMA is only used for transporting objects, not for remote function calls or anything like that.

If you think about multi-core sytems, they also have remote and local memory. For a tbb::task local memory would be thread-local storage... and remote memory would be main memory. In the cluster environment, local memory is main memory, and remote memory requires RDMA operations.

By the way, my definition of cluster means compute nodes that are connected over low-latency networks. When you mentioned waiting for tasks to go out into the world, and hope they come back, that seems to me more like SETI@home... in this case I know that the nodes are up. In the real world a node might be dead, and that case can be handled a bit later... but the point is that in this case I can tell.

RafSchietekat · ‎08-25-2008

No shared memory was implied, just a way of being able to use a heterogeneous cluster. Thanks for specifying how you sidestepped the potential problems documented for RDMA. Sorry for the confusion about "into the world": I did mean a local cluster (nodes can fail).

AJ13 · ‎08-26-2008

I've been looking at some issues with RDMA... in particular the fact that for N nodes, each node will require N-1 connections. Supposedly this can take quite a bit of memory, I'm looking into that right now.

I'm not convinced that I have the right approach to a TBB-cluster interface. Certainly remote memory operations will be important somewhere, but the more I think on it the more I think it's too low-level. I think that task-parallelism is the right approach, and that work-stealing will give a good balance on a cluster.

If we had a nice collection of containers that could be distributed by themselves, the user interface to those containers would hide remote memory completely.... unless they wanted to get into that level of detail. What I think is needed is some higher level abstraction which will make things seem much simpler.