Temporary storage and "actual" concurrency

Chris_Newbold · ‎09-17-2007

I'm experimenting with TBB to replace some home-brew parallelization framework used for algorithms which can have the input subdivided and processed in independent blocks--- an obvious use-case for parallel_for.

In the existing implementation, the input is subdivided into blocks for two reasons: parallelism and memory utilization. In most cases, we need temporary storage equal in size to the input in order to run the algorithm in question; dividing the input into blocks not only opens the door to parallelism, it also puts an upper bound on the amount of temorary storage needed.

I found it relatively straight-forward to convert the existing code to TBB's parallel_for using a blocked_range and setting the grain size to be our former block size. The thing that still has me fumbling is how to handle the temporary storage. Here's a break-down of alternatives and my analysis of each:

Put the temporary storage on the stack in the body given to parallel_for. The amount of storage allocated will by bounded by actual parallelsim (good!) but the amount allocated may be "too much" to prudently put on the stack (possibly bad!)
Allocate heap storage from within the body given to parallel_for. The amount of temporary storage allocated will be bounded by actual parallelism (good!) but we'll pay the cost of repeatedly allocating and deallocating for each task (very bad!)
Allocate heap storage from the copy-constructor for the body. This is no good because we'll instantly pay in both time and space for temporary storage based on problem size divided by grain size (the number of tasks created by parallel_for)

The existing implementation uses threads directly and is able to create one temporary storage buffer for each thread and to do so once at the start of the calculation. This has the desirable properties of bounding temporary storage requirements by actual parallelism (the number of threads), and to amortize the cost of allocation across the entire calculation.

It seems like putting the temporary storage on the stack is most natural, and I suppose I could reduce the grain size to try to keep a lid on the amount used. I'd be concerned, however, that the resulting grain size might be too small and result in reduced efficiency on machines with limited or no extra processors.

Does anyone have any comments or suggestions on the "best" way to solve this sort of problem?

jimdempseyatthecove · ‎09-17-2007

Create a list of pointers to temporary storage for your process. The size of the list is dependent on the number of concurrent threads assigned to your process.

If you are using OpenMP and if only one team of threads will be active for the process at a time then the size of the list of pointers need only be that of the max number of threads allocated to the team. You may want to also keep the current size allocated just in case the current size is too small. You can use the OpenMP team member number as an index into the list of pointers.

On entry into the subroutine (to process a slice of your data) call omp_get_thread_num() to get the OpenMP team member number, test size available at current index (0=not allocated), if not allocated then allocate to size, if allocated but too small then deallocate and allocate to size.

You can determine when it is prudent to reclaim this memory

Jim Dempsey

Chris_Newbold · ‎09-17-2007

Thanks for the reply, but I'm trying to use Thread Building Blocks, not OpenMP....

ARCH_R_Intel · ‎09-18-2007

There are three tactics you might try. The best one depends on the particulars of your situation.

1. Using a bounded amount of stack space might actually be the best strategy, even on a uniprocessor! The reason is cache blocking. Reusing the same buffer, if it fits in cache, may yield improvements that outweigh the overheads of having more tasks than threads. The TBB scheduler is pretty good about this sort of thing.Because of the Cilk-like scheduling strategy used in TBB, thestorage will be bounded by the number of threads, even if you have many more grains of work than threads. Be sure to use the default simple_partitioner, which guarantees the subranges will by bounded by the grainsize. In contrast, auto_partitioner makes no such guarantee.

2. Use parallel_reduce instead of parallel_for, and allocate the buffer in the Body object. Make the join operation a noop. parallel_reduce has an implementation feature where it lazily copies the Body as necessary as threads steal work. On a uniprocessor, no stealing occurs, and hence no extra copies of the Body are generated. With p threads and n grains of work, you'll get somewhere between p bodies and n bodies, depending on how the load balancing goes. We've used this trick of using paralll_reduce in place of parallel_for when TBBifying a SpecOMP benchmark and found it effective when faced with a situation similar to yours.

3. Put the buffers in a tbb::concurrent_queue. Start with an empty queue. Each time you need a buffer, try to pop it from the queue using pop_if_present. If that fails, allocate a new buffer. When done with the buffer, return it to the queue. We have one user who reported success with this approach.

The one drawback of (3) is that there is no cache affinity. We've been thinkinga while about addressing this by introducing a new class concurrent_bag. It would be similar to concurrent_queue, but instead of having FIFO order, would have an order biased towards returning a item that is hot in cache.

I'd be interested in hearing about whether (1), (2), or (3) works for you, and which turned out to be best for your purpose.

- Arch Robison

robert-reed · ‎09-18-2007

Yes, you're right that the most natural place to put the temporary storage is on the thread stack, as realized by creating a member variable in the class that defines the functor body, but is as you say subject to concern about overflowing stack size limits. The normal way to handle that would be in heap allocations, which if regularly discarded as you suggest could result in unacceptable overhead.

So, one answer is to do something in between. Do the heap allocations you suggest, but reuse the buffers rather than discarding them. I used a similar method in some work with the TBB pipeline. I used aconcurrent_queue tohold the buffers in a thread-safe manner in between uses. The logistics of the pipeline are a little different, but Ithink it might still work. The way I think this might be employed is in the body constructor/destroyer:

MyBody::MyBody() { if (!Bpool.pop_if_present(m_buffer)) { m_buffer = new MyBuffer(...); } } MyBody::~MyBody() { Bpool.push(m_buffer); }

Now, every time TBB creates a functor-body object, if the queue fails to deliver, the new should supply it. Every time one of these objects is destroyed, the buffer is recovered in Bpool. Buffer count shouldshould not go higher thanthe number of simultaneous functor-body objects, which should be the same count as the number of threads in the TBB pool. Preserving the buffers avoids the allocate/free costs and the modest replacement cost of some concurrent queue management. You might also want a sanity check to verify the buffer got allocated before using it. Since this is not possible to add to the constructor, some auxilliary member function might be needed as a helper, but the gist of the idea is here. Does this help?

robert-reed · ‎09-18-2007

Arch beat me to the post! FYI, my solution fits under number 3 in Arch's exposition. Cache affinity problems might be an issue in this circumstance. When this technique was employed in a TBB pipeline experimentwith a token count matching the number of HW threads available, the buffer assignments were regularized to the point where each thread always ended up with the same buffer, though in a constructor/destroyer paradigm, your mileage may vary.

Chris_Newbold · ‎09-26-2007

Thanks for your suggestions, Arch.

In retrospect, it doesn't seem all that scary to use the stack (1) for temporary storage. With a rational grain size, the amount of temorary data can be reasonably bounded and certainly the "allocation" will be efficient. Alas the existing code which I'm attempting to adapt isn't quite flexible enough to do this easily. I will look into doing some refactoring to allow this in the future, though.

As for (3), the idea of having a queue of buffers did occur to me, but I didn't like the idea of having any sort synchronization (however lightweight it may be in concurrent_queue) in the body of my parallel_for.

Which brings me to (2). It looks like parallel_reduce has the semantics I had hoped to find in parallel_for-- doing lazy copies of the body. With that sort of visibility into stealing I can readily combine (2) and (3) so that there is no overhead for synchronization on a queue of buffers so long as stealing does not happen. The upper bound on the number of temporary buffers is no longer strictly the number of OS threads, but it is realistically going to be orders of magnitude smaller than the number of grains. This strategy also results in better cache affinity than acquiring a buffer for each grain.

Is there some reason why parallel_for does not use lazy copying for the body? It seems to me that is a very powerful tool.

-Chris