How to implement our cache pattern using TBB?

Chan__Eric · ‎11-30-2017

Hi,

I am revising an older code base's threading architecture to use TBB instead. Most of our existing patterns are pretty easy to migrate. However, we do have one common pattern that I'm having trouble adapting to TBB, and I'm hoping someone here can help. Here's the pattern.

In our software, we often have multiple threads working concurrently (say, reading from) an expensive image object. Because this object is expensive to compute, we use caching extensively. Here's how our cache works.

Consider an object A. Suppose two tasks running concurrently (on two different threads) both want to read from object A. The first time a task tries to access A, A doesn't exist yet, so we need to compute it. It's expensive to compute, so I don't want both tasks to compute A independently (which would mean building two separate objects A). Instead, I'd like to build A once, and cache it, and have the second task simply read the cached version. Finally, the algorithm to build A is MP-friendly, so we use TBB (parallel_for, flow graphs etc.) to build A.

Before the migration to TBB, our code would use a mutex to implement thread-safe access to the cache. The first thread that tries to access A from the cache would:

1. acquire the mutex
2. find that A doesn't exist yet
3. compute A (using MP-friendly algorithms)
4. store A in the cache
5. read whatever data it needs from A
6. release the mutex

The second thread that tries to access the cache would:

7. acquire the mutex
8. lookup A in the cache (it now exists)
9. read the data it needs from the (cached) A
10. release the mutex

If two threads try to read from the cache concurrently, then the second thread is blocked (waits on the mutex).

My understanding is that the above implementation won't work with TBB, because the algorithm to compute A (step 3) would be executing a TBB task graph while a mutex is held (step 1), which is bad (a TBB anti-pattern, right?). This has two problems. First, the TBB scheduler might end up re-entering an outer scope and try to re-acquire the same mutex, which then results in undefined behavior, such as deadlock. Second, other threads waiting on the mutex (step 7) cannot participate in TBB task execution; that is, thread pool is depleted of useful workers.

I considered releasing the mutex before step 3, and re-acquiring it after step 4. However, this means that multiple threads could end up computing A independently (and only the first one that finishes is kept), which seems wasteful.

Any suggestions?

Thanks,
Eric

Chan__Eric · ‎12-01-2017

I found this thread which seems related:

https://software.intel.com/en-us/forums/intel-threading-building-blocks/topic/703652

The "lazy init" pattern is similar, but the proposed solution there doesn't work directly for my case, because our cache isn't just a one-time initialization. It's a data structure that needs to be updated periodically, in a thread-safe fashion.

Another way to phrase the problem I'm trying to solve is that I want something that is similar in spirit to a mutex, but (1) allows any waiting threads to participate in the work pool, instead of spinning uselessly, and (2) prevents tasks from work-stealing an "outer" task that might then try to re-acquire the same lock.

(I understand that the TBB flow graph supports this idea of serialization quite elegantly ... unfortunately, it's not feasible to migrate some parts of our old code base directly to flow graph.)

Thanks,

Eric